OpenClovis Logo

Functional Description
Fault Manager

Description of the Fault Management Service. More...

Description of the Fault Management Service.

Overview

The OpenClovis Fault Manager infrastructure provides a mechanism for managing faults in a system and initiating actions as defined by the user. It can also handle various user-defined run-time faults, including hardware and software faults, and also provides a mechanism for progressive fault reporting based on "probation period"

Non-service impacting alarms are defined as any alarm having severity below CL_ALARM_SEVERITY_MAJOR. They are reported directly to the Fault Manager located on the same node. The action to be taken on receiving a fault is defined by the user and is specific to alarm type and severity. Typically when idetical faults are received within a configured probation period, they get escalated.

Service impacting faults are defined as any alarm having severity CL_ALARM_SEVERITY_MAJOR and above.The Fault Manager interacts with the Availability Management Framework (AMF) while handling service-impacting alarms. The AMF performs service recovery and the Fault Manager takes repair actions as defined by the user. Both AMF and Fault Manager entities are inter-dependent while taking actions in a system during recovery and repair.

The Fault Manager provides the following services:

The Fault Repair action should be done for service impacting alarms. Inorder to do fault repair, the openClovis IDE generates callback functions corresponding to each managed resource which is having alarm service associated to it. Whenever an alarm is raised on the resource and if fault repair action API is called then Fault Manager will call the callback function registered for that resource. The user can do a customized action inside the callback associated to the resource for repairing the fault. The callback functions are generated in the file clFaultRepairHandler.c as a part of config directory of the code generated by IDE.

The Fault reporting is done by Alarm Server for the non-service impacting alarms. It is achieved using the matrix of callback functions. There are five callback functions for an alarm category and severity combination. There are five severity levels critical, major, minor, warning and inderterminate. There are five categories that is communication, quality of service, equipment, processing and enviornment. For example : For the communication category and for each serverity there can be five callback functions, one each for five escalation level. That is there can be twenty five callback functions for the communication category. The callback functions definitions are present in clFaultReportHandler.c file which is generated in the config directory of the model code generated by IDE.

There is a probation period defined which is used as an escalation policy for fault reporting. When the fault is reported on a category plus severity pair for the first time, then the callback function at the first level is called. If the fault occur again within the probation period, then the fault escalation will happen and the callback on the second level is called. This escalation will occur only till level five. So the execution of the callback at the higher level denotes the increasing gravity of the problem for a category plus severity pair.

Before a component can use the clFaultReport() or clFaultRepairAction() functions, it needs to:


Generated on Tue Jan 10 10:29:15 PST 2012 for OpenClovis SDK using Doxygen