Server Administrator classifies events affecting certain critical components in your system using an event type. Normal, warning, and critical are the three most common event types displayed for component status.
This help module defines terms for event types, states, and severities. Read this help module if you want more details about the different terms that Server Administrator uses to classify events and to identify component health.
Whether Server Administrator is reporting the health of a component or classifying an event, the following are the distinguishable attributes of the event: the component or redundancy being monitored, and the type, state, and severity of the event that the component is undergoing.
Server Administrator classifies both components and the redundancy of some components according to type, severity, and state.
All components in a system are important in some way. Systems management applications such as Server Administrator single out some components for special attention. Healthy systems rely especially on a steady supply of electrical power in appropriate voltages to operate system components properly. The electrical power is delivered across the system's alternating current (AC) switch and into the system's power supplies. Components of systems also require a functional range of temperatures inside the chassis. Running programs and doing calculations on data requires well-functioning random-access memory. As a result of these requirements, the power switch, power supplies, fans, and system memory are some of the most important components that Server Administrator monitors.
Server Administrator monitors the health of redundant components and reports redundancy status for the system.
Depending on how vital a system is to the mission of a business entity or organization, some system components are installed in the system with planned redundancy. A system that is critical to an organization's mission is most likely to have redundant components. A redundant component is designed to take over when its companion component fails. Redundancy helps to protect a system from downtime due to shutdown or component damage.
Full redundancy for the entire system means that all devices are working within normal limits. If a system requires four fans for full redundancy and all four are working, the system has full redundancy for the fan component. If either of the two primary fans fails, each fan has a backup. Full redundancy requires no action other than normal preventive maintenance.
Degraded redundancy means that the some of the components that are needed for full redundancy are not working. The system is operational, but not enough components are working to allow an operational component to take over in case of component failure. For example, if four fans are required for full redundancy, three operational fans represents degraded redundancy. Only one of the two primary fans has a backup if it fails.
Lost redundancy means that the system has only the minimum number of components working to prevent system failure. No redundant components are working. If four fans are required for full redundancy and only two fans are working, neither of the primary fans has a backup if a fan fails.
An event is classified by type. Example event types are normal, warning, and critical.
Normal events indicate a component is operating within a range of values that enable the component to perform its function in the system well. Another term often applied to components whose status is normal is OK. When a component is OK, or an event is normal, the system operator does not have to take corrective action.
Warning events occur when a managed component is not operating optimally, but is still able to operate. Warning events provide some lead time to system operators. The appropriate action for a warning event is often to investigate further and to schedule maintenance on the component. Warnings also alert the system operator to pay more attention to a component until the component returns to normal. Power Users and Administrators can define the minimum and maximum values for a warning event. The privilege of defining the warning range allows Power Users and Administrators to build in the reaction time they want in dealing with an operational component that is starting to show signs of degraded performance.
A critical event indicates that a component is either operating outside the bounds of proper functioning or it is not operating at all. A component that is not operating at all is often called nonrecoverable. The system manufacturer defines the critical range for a component because the engineering that goes into the component and its proper functioning is known best by the manufacturer. Critical carries a connotation of more urgency as compared to warning, and system operators take this type of degradation in component performance more seriously. Actions appropriate for a critical or failing component may include immediately shutting down the system or arranging to replace a component very soon.
The state of a component or system attribute is either operational, degraded, or nonoperational.
An operational temperature means that temperature probes inside a chassis are reading temperatures in the normal range of operation.
A degraded temperature means that temperature probes inside a chassis are reading temperatures that are somewhere within the warning range defined by the minimum and maximum operating temperatures required for a warning. The temperature in the chassis is either below the normal minimum or above the normal maximum temperature.
A degraded redundancy means that there are not enough components working to ensure that each critical component has a backup to take over in case it fails.
A nonoperational component or component attribute means that the component is operating either in failure or nonrecoverable range. Using the temperature example, if the system is still working at all, the temperature is so far above or below the normal range that it may trigger a thermal shutdown of the system or the temperature may damage or destroy system components.
Each event type and state for a component is rated according to its severity. Severities for events include informational, minor, major, and critical.
A normal event or component status is in an operational state, and the severity associated with a normal event is informational. The only action that Server Administrator takes for a normal event is informational. Server Administrator informs the system operator that the component is normal.
A warning event may be minor or critical depending on the component. For example, if you remove a fan in a fan-redundant system, the severity of that event is minor.
Some warning events can indicate major risks to the system. If a fan remains outside of the system for an extended period of time, the event could become major because redundancy would be compromised. Extended absence of a component in a system whose mission in an organization requires redundancy could result in component failures without available backups, and could lead to eventual system failure.
Events that detect components within the failure range are critical. Failure of components such as the fans, AC power cords, or memory modules endanger the system's ability to operate and to preserve data.
The following table provides example events for important components and shows how the event type, severity, and state are related.
Component | Event or Alert Type | Severity | State |
AC Power Cord | Normal | Informational | Operational |
AC Power Cord | Failure | Critical | Degraded |
Power Supply | Failure | Critical | Degraded |
Redundancy (for Power System) | Normal | Informational | Operational |
Redundancy (for Power System) | Degraded | Minor | Degraded |
Redundancy (for Power System) | Lost | Major | Degraded |
Temperature | Normal | Informational | Operational |
Temperature | Warning | Minor | Degraded |
Temperature | Failure | Critical | Degraded |
Thermal | Shutdown | Critical | Nonoperational |