Description of the Alarm Management. More...

Description of the Alarm Management.

The alarm is a notification of a specific event. The alarm may or may not represent an error. The application observing this event would report this as an alarm. In the ASP, the Alarm Service provides the feature of defining the set of managed resources, the alarms that can occur on those resources, alarm reporting, alarm clearing, alarm soaking and the alarm corelation. The ASP Alarm service enables applications to notify the north bound entity about erroneous conditions that can occur on managed resources. This service is compliant with the X.733 specification. It provides the functionality to model probable cause, severity, and category compliant to X.733 (ITU-T) specification. /par ASP Alarm service also provides filters that can be applied on the Alarm. It provides a mechanism to specify soaking time, generation rule, and suppression rule.

There can be a condition when there is a spurt of events to be reported as alarm. In order to handle this condition, there is a soaking time defined for the alarm. The soaking time is the time till which the alarm is observed before it is actually reported.
The generation rule and suppression rule are part of the alarm corelation feature. So the reporting of the alarm depends upon the status of the alarms which are part of these rules. While reporting the alarm both the rules are evaluated. In these rules, alarms are bounded by AND or OR relation. /par The usage model of the Alarm service is a producer-consumer model. Applications that raise the Alarm are the producers of the Alarms and the management applications are the consumers of these Alarms. Any application can raise an Alarm, on detecting an erroneous condition on a managed resource. The usage model is similar to a publisher-subscriber model because application components and the management applications are unaware of each other and each Management application receives the Alarms after subscribing to the Alarm's event channel.

Working with Alarm Service

Alarm Characteristics: The mandatory parameters of X.733 specification are incorporated as part of the attributes of an Alarm. The category, probable cause, and the perceived severity of the Alarm are the mandatory parameters of Alarm notification. There are five categories of Alarm and the probable cause parameter defines the probable cause of the Alarm. The severity parameter defines six severity levels: cleared, indeterminate, warning, minor, major, and critical. The following table illustrates the mapping between the category and the list of probable causes of an Alarm.

Alarm Category and Probable Cause:

Communication
- Loss of signal
- Loss of frame
- Framing error
- Local node transmission error
- Remote node transmission error
- Call establishment error
- Degraded signal
- Communications subsystem failure
- Communications protocol error
- LAN error
- DTE-DCE interface error
Quality of service
- Response time excessive
- Queue size exceeded
- Bandwidth reduced
- Retransmission rate excessive
- Threshold crossed
- Performance degraded
- Congestion
- Resource at or nearing capacity
Processing Error
- Storage capacity problem
- Version mismatch
- Corrupt data
- CPU cycles limit exceeded
- Software error
- Software program error
- Software program abnormally terminated
- File error
- Out of memory
- Underlying resource unavailable
- Application subsystem failure
- Configuration or customization error
Equipment
- Power problem
- Timing problem
- Processor problem
- Dataset or modem error
- Multiplexer problem
- Receiver failure
- Transmitter failure
- Receive failure
- Transmit failure
- Output device error
- Input device error
- I/O device error
- Equipment malfunction
- Adapter error
Environmental
- Temperature unacceptable
- Humidity unacceptable
- Heating/ventilation/cooling system problem
- Fire detected
- Flood detected
- Toxic leak detected
- Leak detected
- Pressure unacceptable
- Excessive vibration
- Material supply exhausted
- Pump failure
- Enclosure door open

Alarm State Definition

This section specifies the various states of the alarm before it is actually reported as an event by the alarm server. The alarm could be a triggered after detecting an erroneous condition. When the Alarm is raised till it is reported, it goes through the following stages:

Alarm Raised State: The Alarm is in the raised state when the application, after detecting the erroneous condition,raises the Alarm.
Alarm Soaked State: After the Alarm is soaked for a certain period of time, it enters the Alarm soaked state.
Alarm Generated State: After the Alarm qualifies the generation rule it enters the Alarm generated state. If the Alarm does not qualify the generation rule because of some dependent Alarm, it remains in the soaked state till the dependent Alarms are generated.
Alarm Masked State: If the Alarm is being masked because of the masking logic, it enters into the Alarm masked state. This condition would arise :
- There is an alarm already raised on a managed resource which is higher in heirarchy, then the alarm raised on the resource would be masked.
- The suppression rule is satisfied for the alarm.
Alarm Reported State: If the Alarm is not masked, it enters the Alarm reported state.

Raising an Alarm on a managed resource: The Managed resources which would be raising the alarm are modeled as Alarm Managed Objects in the ASP. The managed resource could be hardware or software resource. The Alarms on the managed resources correspond to the probable cause attribute of the alarm managed object. An application can raise an Alarm on a managed resource using the ASP Alarm service when it detects a deviation from the normal operation.

: Alarm service also allows the application to pass additional context information when raising the Alarm. After the Alarm is successfully reported, Alarm Manager returns a unique Alarm handle for the raised Alarm. This handle is useful in case of the service impacting alarms, while reporting the failure to the Availabilty Management Framework(AMF) service.

For example: An application managing a GigePort resource must follow these steps while raising an Alarm:

The application can use the Alarm service by creating an Alarm MSO class for the GigePort resource.
The likely Alarms that can occur on GigePort resource such as Loss of Signal (LOS) and Loss of Frame(LOF) are associated with the GigePort class while modeling.
The application managing the resource, on detecting an erroneous condition on the specific instance of the GigePort, raises an Alarm by specifying the MOID and the probable cause.

Alarm Filters

The filtering mechanism provided by the Alarm service prevents spurious Alarms from flooding the Alarm Manager. It also applies constraints that an Alarm must satisfy before it is reported as an Alarm. The filtering mechanism, can be classified as soaking, masking, generation, and suppression rule.

Soaking: Soaking is the time duration for which the erroneous condition must persist before it is reported as an Alarm. Soaking allows an Alarm to be monitored for a specific time period.

: The following figures contain the notations "A" and "B" that implies the detection and clearing of the erroneous condition, respectively.

: Figure Assert Soaking Time illustrates an Alarm raised for a specified assert soaking time. In case 1, the Alarm is raised and cleared within the assert soaking time. Hence, the Alarm is not processed. In case 2, the Alarm stays raised, qualifies the assert soaking time, and is processed by the Alarm Manager.

Assert Soaking Time

: Figure Clear Soaking Time illustrates the clearing of a raised Alarm for a specified clear soaking time. In case1, the Alarm is cleared and raised within the clear soaking time and so, the Alarm is not processed. In case2, the Alarm stays cleared, qualifies the clear soaking time. Hence it is processed by the Alarm Manager.

Clear Soaking Time

: Soaking time is configured as per the Alarm. Two different soak periods can be configured, one for assert and the other for clear.

Generation and Suppression Rules: An application can specify a generation rule to model dependencies between Alarms.

A generation rule specifies the following:

The Alarm that needs to be generated.
Set of dependent Alarms.
The condition that has to be satisfied for the Alarm to be generated. This condition is specified as a logical relationship between the dependent Alarms. For example:
Generation rule of A1: A1 = A2 OR A3 OR A4

implies that Alarm A1 is in generated state, when either A2, A3 or A4 Alarms are generated.

A suppression rule specifies the following:

The Alarm that needs to be suppressed.
Set of dependent Alarms.
The condition that needs to be met for the Alarm to be suppressed. This condition is specified as a logical relationship between the dependent Alarms. For example:
Suppression rule of A1: A1 = A5 AND A6

implies that Alarm A1 is in the suppressed state, when both A5 and A6 are suppressed. The Alarm A1 is generated when it qualifies the generation rule and when A5 or A6 is in cleared state.

: When referring to either generation rule or suppression rule, Alarm rule is used in the general context. Generation rule provides a mechanism to specify constraints on generation of Alarms. Suppression rule specifies constraints on the suppression of Alarms. There can be a maximum of four Alarms in an Alarm rule. An Alarm cannot be generated if it is not specified in the Alarm Generation rule. An Alarm rule can either have logical OR or logical AND operators but not both.

Alarm Masking: The Alarm service views the containment relationship of the Alarm MSO to implement hierarchical masking. The Alarm masking algorithm masks all Alarms of the same category within a subtree in the hierarchy. The Alarm masking logic is explained in the following example:

COR Class Tree
An Alarm of probable cause Loss of Signal that belongs to Communication category is reported on Blade1. Alarm service then masks all Alarms of Communication category on its children: Port1 and Port2 as Loss of Signal falls under the Communication category. After the Loss of Signal Alarm is cleared on Blade1, the masked Alarms that satisfy the hierarchical masking rule are published as Alarm notifications and the status is updated in COR.

Event Generation

An event is published for each reported Alarm. Interested services can obtain these events by using the alarm client event subscribe API. The payload of the event consists of the Alarm information and an Alarm handle. The Alarm information comprises of probable cause, category, severity, specific problem, resource on which the Alarm was reported, Alarm State - raised or cleared, time stamp, and additional information of the Alarm. The Alarm handle is unique for every node and is also known as the Notification Identifier. The mapping of the Alarm handle to the reported Alarm is maintained in the Alarm Manager. The Alarm Service depends on the Event Service for the delivery of the event.

Alarm Life Cycle

State Machine

Figure Alarm State Machine illustrates the different stages of an Alarm, when the Alarm is raised till it is reported.

The Alarm enters the raised state when an application detects an erroneous condition and raises the Alarm. A raised Alarm can have an assert or clear status.
After the Alarm is soaked it enters the soaked state.
When the Alarm qualifies the generation rule it enters the generated state. This leads to generation of other Alarms dependent on this Alarm. If the Alarm does not qualify the generation rule, the Alarm stays in the soaked state.
The Alarm is reported if the status of the Alarm is clear after it successfuly qualifies the generation rule,
If the status of the Alarm is assert after the generated state, the Alarm is checked against the hierarchical masking rule. If the parent MO has an Alarm of the same category, the Alarm enters the masked state. Otherwise, the Alarm enters the reported state.
Alarms in the masked state move into reported state after the Alarm on the parent MO of the same category is cleared.

Alarm State Machine

Alarm Flow

Figure Alarm Flow Sequence illustrates the Alarm flow from the detection of the erroneous condition to reporting of the Alarm.

The application detects the erroneous condition and raises the Alarm.
Alarm client soaks the Alarm and on persistence of the erroneous condition it moves the Alarm from raised to soaked state.
When the Alarm qualifies the generation rule, the Alarm client moves the Alarm from the soaked state to the generated state.
After the generation of the Alarm, it is passed to the Alarm server. If the Alarm is not being masked or if the status of the Alarm is clear then the Alarm server performs the following:
1. Generates an unique Alarm handle.
2. Updates the Alarm status in COR.
3. Publishes an event as described in the {Event Generation} section.
4. Reports the Alarm to the Fault Manager if it is a non service impacting Alarm. Non service impacting Alarms are described in the next section.
  
  Alarm Flow Sequence

Modules in Alarm Detection and Reporting

Component Level Interaction of Alarm Manager: The application component detects the erroneous condition and raises the Alarm. The Alarm Manager updates the status in COR only after it qualifies/satisfies the generation rule. It publishes an event for this notification after the update is performed in COR. These notifications can result in a trap generated by SNMP. It reports the non service impacting Alarms to the Fault Manager.

Figure Alarm Flow Sequence Diagram shows the component level interaction of Alarm Manager:

The application component raises an Alarm using the API provided by the Alarm client.
Alarm client soaks the Alarm and after it qualifies the generation rule, sends it to Alarm Server.
Alarm Server performs a check if the Alarm is masked. If it is not masked, it updates the current Alarm status and time stamp in COR. The Alarm status indicates if it is asserted or cleared.
Alarm Server also publishes a notification for this Alarm.
Alarm Server reports to the Fault Manager, if it is a non service impacting Alarm. The next section provides more information about non service impacting Alarms.

Alarm Flow Sequence Diagram

Alarm Handling

Figure Alarm Handling explains how service impacting and non service impacting Alarms are handled. It describes the interaction between the Alarm Manager(AM), Fault Manager(FM), Clovis Object Repository(COR), Availability Management Framework (AMF), application component that raises an Alarm, and the management application.

The application component uses the Alarm service by linking with the Alarm client.
When an erroneous condition is detected, it raises an Alarm by calling the Alarm raise API provided by the Alarm client.
After the Alarm satisfies the soaking time and the generation rule, it is sent to the Alarm server.
Alarm server checks if the Alarm qualifies the hierarchical masking rule, updates COR, and publishes an event.
Alarm Server returns a unique handle called the notification identifier to the application component that raised the Alarm. The mapping of this handle to the reported Alarm is maintained in the Alarm server side. The Alarm handle is unique for every node.
1. If the Alarm is a service impacting Alarm, Alarm server does not report the Alarm to the Fault Manager.
2. The application component calls the {{saAmfComponentErrorReport}} API for service recovery and the Alarm handle is passed to AMF. After recovering the service, AMF calls the Fault Manager to repair the managed resource.
If the Alarm is a non service impacting Alarm, Alarm server reports the Alarm directly to the Fault Manager.

Service Impacting and Non Service Impacting Alarms