OpenClovis Logo

Functional Description
Checkpointing Service

Description of the Checkpointing Service. More...

Description of the Checkpointing Service.

The OpenClovis Checkpoint Service (CPS) is a high availability infrastructure component that provides synchronization of run-time data and context that ensures a seamless failover or switchover of applications. CPS allows the application to store its internal state and retrieve the information immediately, at regular intervals, during a failover, a switchover, or at a restart. It provides the facility for processes to record Checkpoint data incrementally to protect an application against failures. While recovering from fail-over, restart, or switch-over situations, the Checkpoint data can be retrieved and execution can be resumed from the state recorded before the failure. CPS maintains at least one backup replica for a Checkpoint (provided resource), so that the information is not lost. A Checkpoint can be replicated immediately on a different blade and stored indefinitely, or it can be consumed immediately.

CPS provides two types of Checkpointing:

Usage Model

The users of CPS are usually highly available applications. The active application checkpoints data sufficient to recreate its internal state as indicated by application specific logic. Standby components read the checkpointed data when activated, or as otherwise indicated by application specific logic.

CPS neither interprets the stored information nor provides any logic on when to read or write a Checkpoint (an application that has registered for immediate consumption, is an exception.) The application is responsible for interpreting the checkpointed data as well as deciding when to read or write that data.

The Checkpointed information is replicated by CPS in persistent memory (in case of file based Checkpointing) or in replica nodes (server-based Checkpointing.)

The users can also choose for a trigger from CPS, by registering a callback function. This callback function is invoked when the Checkpoint that the user is interested in, is updated. This is also referred to as registering for immediate consumption.

Use cases

File/Library based Checkpointing
CPS replicates the information in the persistent memory. This implies that the user application can survive application restart and node restart scenarios. Typically, this form of Checkpointing is used when there are no other nodes in the system where the replicas can be stored.
Server based Checkpointing
CPS replicates the information on other nodes in the system where the replicas can be stored (if it is available). Typically, this form of Checkpointing is used by the applications that need to recover from fail-over or switch-over situations. Server based Checkpointing supports three types of Checkpoints:
  • Synchronous Checkpoints - CPS ensures that such Checkpoints provides consistency of data across replicas.
  • Asynchronous non-collocated Checkpoints - CPS does not ensure consistency of data across replicas but the performance while writing to a Checkpoint (speed) is faster. In asynchronous non-collocated Checkpoints, the CPS decides where the active replica resides.
  • Asynchronous collocated Checkpoints - These are similar to asynchronous non-collocated Checkpoints except that the control where the active replica lies with the user application.

Service Description

CPS provides a service to store information that can be read by users immediately, or, later depending on the logic of the user application. The CPS functional description can be divided into the following sections based on the type of service used.

File/Library based Checkpointing

Checkpoint Life cycle
For using the services of CPS, the application has to initialize the CPS using the clCkptLibraryInitialize() function. clCkptLibraryInitialize() returns a handle, ClCkptSvcHdlT, that can be used in subsequent operations on the CPS to identify the initialization. When this handle is not required, the association can be closed using the clCkptLibraryFinalize() and all the allocated resources are freed. In order to use File/Library based checkpointing, the application should be linked with DBAL library, because checkpoint library internally uses DBAL library to persist the data into database. While modeling checkpoint library applications, DBAL library should be enabled.
Checkpoint Management
An application can create a Checkpoint by using the clCkptLibraryCkptCreate() function. When a Checkpoint is created, the datasets within the Checkpoint can be created using the clCkptLibraryCkptDataSetCreate() function. The data is stored and retrieved from these datasets. While creating a dataset, the user has to specify the serializer and deserializer functions that contain the logic for packing and unpacking the Checkpointed information, respectively. CPS does not interpret the Checkpointed information.
The existence of datasets can be checked by using clCkptLibraryDoesDatasetExist() function. The corresponding function for checking the existence of a Checkpoint is clCkptLibraryDoesCkptExist().
An element in a dataset is created using the clCkptLibraryCkptElementCreate() function. The user has to specify the element serialiser and deserializer functions for packing and unpacking the element form a Checkpoint. The element can be deleted using clCkptLibraryCkptElementDelete() function, a dataset is deleted using the clCkptLibraryCkptDataSetDelete() function and a Checkpoint can be deleted using clCkptLibraryCkptDelete().
Data access
CPS provides clCkptLibraryCkptDataSetWrite() function to write the entire dataset into the database. An element is to be written to a dataset using the clCkptLibraryCkptElementWrite() function. CPS invokes the corresponding serializer function and provides a buffer into which the user has to pack the information to be Checkpointed. CPS then stores this buffer into the database.
The contents of a dataset can be read using the clCkptLibraryCkptDataSetRead() function. CPS invokes the corresponding deserializer function and provides a buffer containing the data that was stored earlier. There is no provision for reading a particular element in a dataset.
Persistence
CPS stores the Checkpointed data in the main memory using database configured by the user in clDbalConfig.xml.

Server Based Checkpointing

Checkpoint Life cycle
To use the services of CPS, the application has to initialize the CPS using the clCkptInitialize() function. The clCkptInitialize() function returns a handle, ckptSvcHandle, that can be used in subsequent operations on the CPS to identify the initialization. When this handle is not required, the association can be closed using the clCkptFinalize() function.
CPS provides a selection object based approach for handling pending callbacks. Applications can use the clCkptSelectionObjectGet() and clCkptDispatch() functions for the same.
Checkpoint Management
After the CPS is initialized, the application can open a Checkpoint in create/read/write mode using the clCkptCheckpointOpen() or clCkptCheckpointOpenAsync() functions. A handle, CheckpointHandle, that identifies this Checkpoint is returned and the other Checkpoint management and data access functions can use this handle for accessing the Checkpoint.
The applications that have opened the Checkpoint in read/write mode can close the Checkpoint using the clCkptCheckpointClose() function. If the Checkpoint is not required and the clCkptCheckpointDelete() function is not called, the retention timer (specified during the Checkpoint open) is started. When the timer expires, the Checkpoint is deleted. This is performed to avoid the accumulation of unused Checkpoints in the system. The application can update the retention time of the Checkpoint using the clCkptCheckpointRetentionDurationSet() function.
Checkpoints can be deleted using the clCkptCheckpointDelete() function. But the Checkpointed data is retained till it is in use. For an asynchronous collocated Checkpoint, the decision where the active replica should reside, lies with the user application. The application can set a particular replica as an active replica using the clCkptActiveReplicaSet() function. The user application can also inquire the status (various attributes) of the Checkpoint using the clCkptCheckpointStatusGet() function.
Section management
A Checkpoint can have one or more sections. By default, CPS creates a section (called as default section) for every Checkpoint. The information to be Checkpointed is stored in these sections. Associated with each section is a section identifier that identifies a particular section. The user application can create a section in a Checkpoint using the clCkptSectionCreate() function and delete a section using the clCkptSectionDelete() function.
A section created by the user application is deleted automatically after section expiration time (that is specified in clCkptSectionCreate()) has passed. This expiry time can be modified using the clCkptSectionExpirationTimeSet() function. The user can also iterate through the sections of a Checkpoint to search for infinite expiry time, expiry time greater than a particular value, or for a corrupted section. ClCkptSectionIterationInitialize(), clCkptSectionIterationNext(), and clCkptSectionIterationFinalize() functions provide this facility.
Data Access
CPS uses ioVectors to read and write data to a Checkpoint. CPS provides two functions to write data: clCkptCheckpointWrite() for writing to multiple sections in a Checkpoint and clCkptSectionOverwrite() for updating a particular section of a Checkpoint.
The data can be read from a Checkpoint using the clCkptCheckpointRead() function. CPS does not ensure consistency of data across replicas of asynchronous Checkpoints. The Checkpoint data at replicas can be synchronized using the clCkptCheckpointSynchronize() or clCkptCheckpointSynchronizeAsync() functions.
CPS also provides notification of immediate consumption to the users. To receive this notification, users can register their callback functions using the clCkptImmediateConsumptionRegister() function. The callback is invoked when a change in Checkpointed data occurs.
Replica management
CPS ensures that there exists atleast one more replica of a Checkpoint in the system provided resource exists, is there. If only the System controller is present, the checkpoint is not replicated anywhere, in this scenario, checkpoint service doesn't support restart. CPS marks the local replica (where the Checkpoint is opened in create mode) as the active replica for all Checkpoints other than asynchronous collocated Checkpoints.

Generated on Tue Jan 10 10:29:15 PST 2012 for OpenClovis SDK using Doxygen