|
|
Line 497: |
Line 497: |
| Within <code>checkpoint_read_seq</code> we pass the iovector to <code>clCkptCheckpointRead</code> along with the <code>ckpt_handle</code>, the constant 1, and the address of the <code>err_idx</code> variable. The constant 1 is the number of elements in the array of <code>ClCkptIOVectorElementT structs</code> that we're passing. That is, we're passing the address of a single <code>ClCkptIOVectorElementT struct</code>. The <code>err_idx</code> will hold the index in that array where <code>clCkptCheckpointRead</code> found an error. That is, if there is going to be an error, it will be at index 0 since we're only passing on entry in the <code>iovector</code>. | | Within <code>checkpoint_read_seq</code> we pass the iovector to <code>clCkptCheckpointRead</code> along with the <code>ckpt_handle</code>, the constant 1, and the address of the <code>err_idx</code> variable. The constant 1 is the number of elements in the array of <code>ClCkptIOVectorElementT structs</code> that we're passing. That is, we're passing the address of a single <code>ClCkptIOVectorElementT struct</code>. The <code>err_idx</code> will hold the index in that array where <code>clCkptCheckpointRead</code> found an error. That is, if there is going to be an error, it will be at index 0 since we're only passing on entry in the <code>iovector</code>. |
| | | |
− | static ClRcT
| |
− | checkpoint_replica_activate(void)
| |
− | {
| |
− | SaAisErrorT rc = SA_AIS_OK;
| |
| | | |
− | if ((rc = saCkptActiveReplicaSet(ckpt_handle)) != SA_AIS_OK)
| + | {| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680" |
| + | ! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c |
| + | |- |
| + | |<code><pre> |
| + | |
| + | static ClRcT |
| + | checkpoint_replica_activate(void) |
| { | | { |
− | clprintf(CL_LOG_SEV_ERROR, | + | SaAisErrorT rc = SA_AIS_OK; |
− | "checkpoint_replica_activate failed [0x%x] in ActiveReplicaSet",
| + | |
− | rc);
| + | |
− | }
| + | |
− | else rc = CL_OK;
| + | |
| | | |
− | return rc;
| + | if ((rc = saCkptActiveReplicaSet(ckpt_handle)) != SA_AIS_OK) |
− | } | + | { |
| + | clprintf(CL_LOG_SEV_ERROR, |
| + | "checkpoint_replica_activate failed [0x%x] in ActiveReplicaSet", |
| + | rc); |
| + | } |
| + | else rc = CL_OK; |
| + | |
| + | return rc; |
| + | } |
| | | |
| ===csa203=== | | ===csa203=== |
Revision as of 05:02, 1 October 2010
csa103 Checkpointing
Objective
csa103 demonstrates how to use the Clovis Checkpoint service to provide basic failure recovery. csa103 builds on csa102.
What Will You Learn
- how to initialize the checkpoint client library,
- how to create a checkpoint and open a checkpoint,
- how to create a section in the checkpoint
- how to save data to the checkpoint section and read data from the checkpoint section.
The Code
The code can be found within the following directory
<project-area_dir>/eval/src/app/csa103Comp
clCompAppMain.c
|
ClCompAppInitialize()
...
if ((rc = checkpoint_initialize()) != CL_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: Failed [0x%x] to initialize checkpoint",
appname, rc);
return rc;
}
if ((rc = checkpoint_read_seq(&seq)) != CL_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: Failed [0x%x] to read checkpoint",
appname, rc);
checkpoint_finalize();
return rc;
}
#ifdef CL_INST
if ((rc = clDataTapInit(DATA_TAP_DEFAULT_FLAGS, 103)) != CL_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: Failed [0x%x] to initialize data tap",
appname, rc);
}
#endif
|
The above software in clCompAppMain.c
is new to csa103. The call to checkpoint_initialize
is pretty straightforward. This function discussed in detail later within this section. The call to checkpoint_read_seq
is also pretty straightforward. It reads the current sequence value from the checkpoint (at this point it should be zero) and loads it into the variable: seq
. It is important to note the call to checkpont_finalize
if the call to checkpoint_read_seq
does not return CL_OK
. This call closes the checkpoint and cleanly finalizes the checkpoint library. We'll look more at checkpoint_read_seq
and checkpoint_finalize
later.
clCompAppMain.c
|
/* Checkpoint new sequence number */
rc = checkpoint_write_seq(seq);
if (rc != CL_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Checkpoint write failed. Exiting.", appname);
break;
}
|
The rest of the csa103's main loop is the same as the main loop in csa102, but the seven lines above are new. Here we write the new value of the checkpoint variable to the checkpoint section and print and error if this fails. We'll look closer at checkpoint_write_seq
later.
clCompAppMain.c
|
ClRcT
clCompAppAMFCSISet(
ClInvocationT invocation,
const ClNameT *compName,
ClAmsHAStateT haState,
ClAmsCSIDescriptorT csiDescriptor)
{
/*
* ---BEGIN_APPLICATION_CODE---
*/
ClCharT compname[100]={0};
/*
* ---END_APPLICATION_CODE---
*/
/*
* Print information about the CSI Set
*/
strncpy(compname, compName->value, compName->length);
clprintf (CL_LOG_SEV_INFO, "Component [%s] : PID [%d]. CSI Set Received",
compname, (int)mypid);
clCompAppAMFPrintCSI(csiDescriptor, haState);
/*
Take appropriate action based on state
*/
switch ( haState )
{
case CL_AMS_HA_STATE_ACTIVE:
{
/*
AMF has requested application to take the active HA state
for the CSI.
*/
/*
---BEGIN_APPLICATION_CODE---
*/
|
clCompAppMain.c
|
clprintf(CL_LOG_SEV_INFO,"%s: Active state requested from state %d",
appname, ha_state);
checkpoint_replica_activate();
if (ha_state == SA_AMF_HA_STANDBY)
{
/* Read checkpoint, make our replica the active replica */
clprintf(CL_LOG_SEV_INFO,"%s reading checkpoint", appname);
checkpoint_read_seq(&seq);
clprintf(CL_LOG_SEV_INFO,"%s read checkpoint: seq = %u", appname, seq);
}
ha_state = SA_AMF_HA_ACTIVE;
/*
* ---END_APPLICATION_CODE---
*/
clCpmResponse(cpmHandle, invocation, CL_OK);
break;
}
case CL_AMS_HA_STATE_STANDBY:
{
/*
* AMF has requested application to take the standby HA state
* for this CSI.
*/
/*
* ---BEGIN_APPLICATION_CODE---
*/
clprintf(CL_LOG_SEV_INFO," Standby state requested from state %d",ha_state);
ha_state = SA_AMF_HA_STANDBY;
/*
* ---END_APPLICATION_CODE---
*/
clCpmResponse(cpmHandle, invocation, CL_OK);
break;
}
|
clCompAppMain.c
|
case CL_AMS_HA_STATE_QUIESCED:
{
/*
* AMF has requested application to quiesce the CSI currently
* assigned the active or quiescing HA state. The application
* must stop work associated with the CSI immediately.
*/
/*
* ---BEGIN_APPLICATION_CODE---
*/
clprintf(CL_LOG_SEV_INFO,"%s: QUIESCED", appname);
ha_state = haState;
/*
* ---END_APPLICATION_CODE---
*/
clCpmResponse(cpmHandle, invocation, CL_OK);
break;
}
case CL_AMS_HA_STATE_QUIESCING:
{
/*
* AMF has requested application to quiesce the CSI currently
* assigned the active HA state. The application must stop work
* associated with the CSI gracefully and not accept any new
* workloads while the work is being terminated.
*/
/*
* ---BEGIN_APPLICATION_CODE---
*/
clprintf(CL_LOG_SEV_INFO,"%s: QUIESCING", appname);
ha_state = haState;
/*
* ---END_APPLICATION_CODE---
*/
clCpmCSIQuiescingComplete(cpmHandle, invocation, CL_OK);
break;
}
|
clCompAppMain.c
|
default:
{
break;
}
}
return CL_OK;
}
|
Now looking at csa103's implementation of clCompAppAMFCSISet
, we see that rather than just a few cases, we handle several cases: QUIESCING
, QUIESCED
, ACTIVE
, and STANDBY
. In each state, the global variable ha_state
to the new_state value that is passed in is set. This is similar to csa102, except that we add the feature that in the CL_AMS_HA_STATE_ACTIVE
case if our previous state was CL_AMS_HA_STATE_STANDBY
we read the sequence from the checkpoint so that the main loop will pick it up.
clCompAppMain.c
|
static ClRcT
checkpoint_initialize()
{
SaAisErrorT rc = CL_OK;
SaVersionT ckpt_version = {'B', 1, 1};
SaNameT ckpt_name = { strlen(CKPT_NAME), CKPT_NAME };
ClUint32T seq_no;
SaCkptCheckpointCreationAttributesT create_atts = {
.creationFlags = SA_CKPT_WR_ACTIVE_REPLICA_WEAK |
SA_CKPT_CHECKPOINT_COLLOCATED,
.checkpointSize = sizeof(ClUint32T),
.retentionDuration = (ClTimeT)10,
.maxSections = 2, // two sections
.maxSectionSize = sizeof(ClUint32T),
.maxSectionIdSize = (ClSizeT)64
};
SaCkptSectionCreationAttributesT section_atts = {
.sectionId = &ckpt_sid,
.expirationTime = SA_TIME_END
};
clprintf(CL_LOG_SEV_INFO,"%s: checkpoint_initialize", appname);
/* Initialize checkpointing service instance */
rc = saCkptInitialize(&ckpt_svc_handle, /* Checkpoint service handle */
NULL, /* Optional callbacks table */
&ckpt_version); /* Required verison number */
if (rc != SA_AIS_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Failed to initialize checkpoint service",
appname);
return rc;
}
clprintf(CL_LOG_SEV_INFO,"%s: Checkpoint service initialized (handle=0x%llx)",
appname, ckpt_svc_handle);
|
Here is checkpoint_initialize
. Here we initialize the ClNameT
structure that holds the name of the checkpoint to be opened/created with the line: ClNameT ckpt_name = { strlen(CKPT_NAME), CKPT_NAME };
. We then define a set of attributes to associate with the checkpoint when creating that checkpoint. The creationFlags
includes CL_CKPT_WR_ACTIVE_REPLICA
. This means that the checkpoint to be created will be asynchronous, or once active server updation is completed, the call returns to the application, all other replica updates happens parallel. CheckpointSize
is set to the sizeof a 32 bit unsigned integer, which is the sequence number we print in the main loop. The retentionDuration
is the time that the checkpoint service will keep the checkpoint in store after the last client has closed the checkpoint. MaxSections
is 2 as this checkpoint will be used to store two sections. MaxSectionSize
is the set to the sizeof a 32 bit unsigned integer, as application stores only unsigned integer. And maxSectionIdSize
is 64 bytes.
We next use ClCkptSectionCreationAttributesT section_atts
to define the creation attributes for the sole section of the checkpoint. The sectionId
is just the name of the section along with the number of bytes in the name. The expirationTime
is set to CL_TIME_END
to imply that the section is never deleted automatically if the checkpoint itself still remains.
The call to clCkptInitialize
has to be called before any other checkpoint functions. The ckpt_svc_handle
is where the handle is returned. The handle must be passed to some future checkpoint api calls, namely the clCkptCheckpointOpen
and clCkptFinalize
calls. The callbacks table is passed as NULL
which implies that our application doesn't provide any callbacks. Finally the ckpt_version
identifies what version of the api this application is written to.
clCompAppMain.c
|
//
// Create the checkpoint for read and write.
rc = saCkptCheckpointOpen(ckpt_svc_handle, // Service handle
&ckpt_name, // Checkpoint name
&create_atts, // Optional creation attr.
(SA_CKPT_CHECKPOINT_READ |
SA_CKPT_CHECKPOINT_WRITE |
SA_CKPT_CHECKPOINT_CREATE),
(SaTimeT)-1, // No timeout
&ckpt_handle); // Checkpoint handle
if (rc != SA_AIS_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Failed [0x%x] to open checkpoint",
appname, rc);
(void)saCkptFinalize(ckpt_svc_handle);
return rc;
}
clprintf(CL_LOG_SEV_INFO,"%s: Checkpoint opened (handle=0x%llx)", appname, ckpt_handle);
|
clCompAppMain.c
|
/*
* Try to create a section so that updates can operate by overwriting
* the section over and over again.
* If subsequent processes come through here, they will fail to create
* the section. That is OK, even though it will cause an error message
* If the section create fails because the section is already there, then
* read the sequence number
*/
// Put data in network byte order
seq_no = htonl(seq);
// Creating the section
checkpoint_replica_activate();
rc = saCkptSectionCreate(ckpt_handle, // Checkpoint handle
§ion_atts, // Section attributes
(SaUint8T*)&seq_no, // Initial data
(SaSizeT)sizeof(seq_no)); // Size of data
if (rc != SA_AIS_OK && (CL_GET_ERROR_CODE(rc) != SA_AIS_ERR_EXIST))
{
clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Failed to create checkpoint section", appname);
(void)saCkptCheckpointClose(ckpt_handle);
(void)saCkptFinalize(ckpt_svc_handle);
return rc;
}
else if (rc != SA_AIS_OK && (CL_GET_ERROR_CODE(rc) == SA_AIS_ERR_EXIST))
{
rc = checkpoint_read_seq(&seq);
if (rc != CL_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Failed [0x%x] to read checkpoint section",
appname, rc);
(void)saCkptCheckpointClose(ckpt_handle);
(void)saCkptFinalize(ckpt_svc_handle);
return rc;
}
}
else
{
clprintf(CL_LOG_SEV_INFO,"%s: Section created", appname);
}
return CL_OK;
}
|
With rc = clCkptCheckpointOpen
we attempt to open the specified checkpoint. When checkpoint_initialize
is called the checkpoint may or may not exist. We attempt to open the checkpoint for READ/WRITE access.
Next we check if we created the checkpoint and then we need to create the section with a call to clCkptSectionCreate
. We pass the checkpoint handle obtained from clCkptCheckpointOpen
, the section attributes declared earlier, and the address of the initial value along with the size in bytes of the initial value.
We convert the initial value to network byte order by the code seq_no = htonl(seq)
, before passing it to clCkptSectionCreate
.
If we do not create the section, then we read the current value in the checkpoint into our global seq variable.
Note that whenever we return an error return code from the function that we have either not opened the checkpoint/initialized the checkpoint service, or we have closed the checkpoint and finalized the checkpoint service with calls to clCkptCheckpointClose
and clCkptFinalize
.
clCompAppMain.c
|
static ClRcT
checkpoint_finalize(void)
{
SaAisErrorT rc;
rc = saCkptCheckpointClose(ckpt_handle);
if (rc != SA_AIS_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: failed: [0x%x] to close checkpoint handle 0x%llx",
appname, rc, ckpt_handle);
}
rc = saCkptFinalize(ckpt_svc_handle);
if (rc != SA_AIS_OK)
{
clprintf(CL_LOG_SEV_ERROR,"%s: failed: [0x%x] to finalize checkpoint",
appname, rc);
}
return CL_OK;
}
|
The above code snippet presents checkpoint_finalize
. It's quite simple. First it closes the checkpoint handle with a call to clCkptCheckpointClose
. Next it finalizes the checkpoint library with a call to clCkptFinalize
.
clCompAppMain.c
|
static ClRcT
checkpoint_write_seq(ClUint32T seq)
{
SaAisErrorT rc = SA_AIS_OK;
ClUint32T seq_no;
/* Putting data in network byte order */
seq_no = htonl(seq);
/* Write checkpoint */
retry:
rc = saCkptSectionOverwrite(ckpt_handle,
&ckpt_sid,
&seq_no,
sizeof(ClUint32T));
if (rc != SA_AIS_OK)
{
clprintf(CL_LOG_SEV_ERROR,"Failed [0x%x] to write to section", rc);
if(rc == SA_AIS_ERR_NOT_EXIST)
rc = checkpoint_replica_activate();
if(rc == CL_OK) goto retry;
}
else
{
/*
* Synchronize the checkpoint to all the replicas.
*/
rc = saCkptCheckpointSynchronize(ckpt_handle, SA_TIME_END );
if (rc != SA_AIS_OK)
{
clprintf(CL_LOG_SEV_ERROR,"Failed [0x%x] to synchronize the checkpoint", rc);
}
}
return CL_OK;
}
|
With checkpoint_write_seq
we first convert the sequence number passed into network byte order. Then we pass it to clCKptSectionOverwrite
which writes it to the checkpoint section created in checkpoint_initialize
.
clCompAppMain.c
|
static ClRcT
checkpoint_read_seq(ClUint32T *seq)
{
ClRcT rc = CL_OK;
ClUint32T err_idx; /* Error index in ioVector */
ClUint32T seq_no = 0xffffffff;
SaCkptIOVectorElementT iov = {
.sectionId = ckpt_sid,
.dataBuffer = (ClPtrT)&seq_no,
.dataSize = sizeof(ClUint32T),
.dataOffset = (ClOffsetT)0,
.readSize = sizeof(ClUint32T)
};
rc = saCkptCheckpointRead(ckpt_handle, &iov, 1, &err_idx);
if (rc != SA_AIS_OK)
{
clprintf(CL_LOG_SEV_ERROR,"Error: [0x%x] from checkpoint read, err_idx = %u",
rc, err_idx);
}
/* FIXME: How to process this err_idx? */
*seq = ntohl(seq_no);
return CL_OK;
}
|
With checkpoint_read_seq
we initialize seq_no
to all ones just so it will be more obvious if the clCkptCheckpointRead
call fails and the value of seq_no
doesn't get overwritten. Then, we set up the iov
variable. We set the sectionID
field to ckpt_sid
which is the id of the section created in checkpoint_initialize
. The dataBuffer
gets set to the address of the seq_no
variable which is where we want the checkpoint data loaded. The dataSize
field is set to the the size of the seq_no
variable. DataOfset
is set to zero since the sequence number is the only thing stored in the section, so it resides at the front of the section. readSize
is set to the size of the sequence number variable.
Within checkpoint_read_seq
we pass the iovector to clCkptCheckpointRead
along with the ckpt_handle
, the constant 1, and the address of the err_idx
variable. The constant 1 is the number of elements in the array of ClCkptIOVectorElementT structs
that we're passing. That is, we're passing the address of a single ClCkptIOVectorElementT struct
. The err_idx
will hold the index in that array where clCkptCheckpointRead
found an error. That is, if there is going to be an error, it will be at index 0 since we're only passing on entry in the iovector
.
clCompAppMain.c
|
static ClRcT
checkpoint_replica_activate(void)
{
SaAisErrorT rc = SA_AIS_OK;
if ((rc = saCkptActiveReplicaSet(ckpt_handle)) != SA_AIS_OK)
{
clprintf(CL_LOG_SEV_ERROR,
"checkpoint_replica_activate failed [0x%x] in ActiveReplicaSet",
rc);
}
else rc = CL_OK;
return rc;
}
===csa203===
csa203 demonstrates the usage of SA Forum's Checkpointing service. This sample application does not deviate functionally from csa103. The code differences are due to using SA Forum data types (structures) and APIs , as presented in the following two tables. (Note we have not repeated data types and APIs covered previously.)
{| cellspacing="0" border="1" align="center"
|+align="bottom"| '''SA Forum Data Types with the SAFplus Platform equivalent'''
|- style="color:black;background-color:#ffffaa;" align="center"
!SA Forum Data Types
!OpenClovis Data Types
|-
|SaCkptHandleT
|ClCkptSvcHdlT
|-
|SaCkptHandleT
|ClCkptHdlT
|-
|SaCkptSectionIdT
|ClCkptSectionIdT
|-
|SaCkptCheckpointCreationAttributesT
|ClCkptCheckpointCreationAttributesT
|-
|SaCkptSectionCreationAttributesT
|ClCkptSectionCreationAttributesT
|-
|SaCkptIOVectorElementT
|ClCkptIOVectorElementT
|}
{| cellspacing="0" border="1" align="center"
|+align="bottom"| '''SA Forum APIs with the SAFplus Platform equivalent'''
|- style="color:black;background-color:#ffffaa;" align="center"
!SA Forum APIs
!OpenClovis APIs
|-
|saCkptInitialize
|clCkptInitialize
|-
|saCkptCheckpointOpen
|clCkptCheckpointOpen
|-
|saCkptSectionCreate
|clCkptSectionCreate
|-
|saCkptCheckpointClose
|clCkptCheckpointClose
|-
|saCkptFinalize
|clCkptFinalize
|-
|saCkptSectionOverwrite
|clCkptSectionOverwrite
|-
|saCkptCheckpointSynchronize
|clCkptCheckpointSynchronize
|-
|saCkptCheckpointRead
|clCkptCheckpointRead
|}
===How to Run csa103 and What to Observe===
This sample application can run on all Runtime Hardware Setups. The following, lists which node <code>csa103compI*</code> runs on.
<ul>
<li>'''Runtime Hardware Setup 1.1 and 2.1'''
<br>csa103compI0 and csa103compI1 will run upon the single node.
<li>'''Runtime Hardware Setup 1.2 and 2.2'''
<br>csa103compI0 runs on PayloadNodeI0 and csa103compI1 runs on PayloadNodeI1
<li>'''Runtime Hardware Setup 1.3 and 2.3'''
<br>csa103compI2 and csa103compI3 run on SCNodeI0 and SCNodeI1 respectively. csa103compI0 and csa103compI1 run on PayloadNodeI0 and PayloadNodeI1 respectively.
<br><br>In the four node set up SCNodeI0 and SCNodeI1 output data via <code>/root/asp/var/log/csa103CompI2</code> and <code>/root/asp/var/log/csa103CompI3</code> respectively. Therefore in this case follow the below instructions replacing csa103CompI0 and csa103compI1 with csa103compI2 and csa103compI3 respectively.
</ul>
We run csa103 very much the same way we run any of the rest of the sample applications: with the SAFplus Platform Console.
<ol>
<li>First, on the active System Controller move it to state LockAssignment with (Unlock csa203SGI0 instead of csa103SGI0 to run csa203)
<code><pre>
cli[Test]-> setc 1
cli[Test:SCNodeI0]-> setc cpm
cli[Test:SCNodeI0:CPM]-> amsLockAssignment sg csa103SGI0
The following output is given when you run tail -f on the csa103 log files. For example:
# tail -f /root/asp/var/log/csa103CompI0.log
/root/asp/var/log/csa103CompI0Log.latest
|
Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00030 : INFO)
Component [csa103CompI0] : PID [15238]. Initializing
Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00031 : INFO)
IOC Address : 0x1
Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00032 : INFO)
IOC Port : 0x81
Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00033 : INFO)
csa103: Instantiated as component instance csa103CompI0.
Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00034 : INFO)
csa103CompI0: Waiting for CSI assignment...
Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00035 : INFO)
csa103CompI0: Waiting for CSI assignment...
Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00036 : INFO)
csa103CompI0: checkpoint_initialize
Mon Jul 14 00:01:10 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00043 : INFO)
csa103CompI0: Checkpoint service initialized (handle=0x1)
Mon Jul 14 00:01:10 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00045 : INFO)
csa103CompI0: Checkpoint opened (handle=0x2)
|
# tail -f /root/asp/var/log/csa103CompI1.log
/root/asp/var/log/csa103CompI1Log.latest
|
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00030 : INFO)
Component [csa103CompI1] : PID [15234]. Initializing
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00031 : INFO)
IOC Address : 0x1
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00032 : INFO)
IOC Port : 0x80
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00033 : INFO)
csa103: Instantiated as component instance csa103CompI1.
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00034 : INFO)
csa103CompI1: Waiting for CSI assignment...
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00035 : INFO)
csa103CompI1: Waiting for CSI assignment...
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00036 : INFO)
csa103CompI1: checkpoint_initialize
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00043 : INFO)
csa103CompI1: Checkpoint service initialized (handle=0x1)
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00045 : INFO)
csa103CompI1: Checkpoint opened (handle=0x2)
Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00053 : INFO)
csa103CompI1: Section created
|
Then, unlock the application by running
cli[Test:SCNodeI0:CPM]-> amsUnlock sg csa103SGI0
In the application log files, you should see
/root/asp/var/log/csa103CompI0Log.latest
|
Mon Jul 14 00:09:08 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00058 : INFO)
Standby state requested from state 0
|
/root/asp/var/log/csa103CompI1Log.latest
|
Mon Jul 14 00:09:08 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00064 : INFO)
csa103CompI1: Active state requested from state 0
Mon Jul 14 00:09:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00065 : INFO)
csa103CompI1: Hello World! (seq=0)
Mon Jul 14 00:09:10 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00066 : INFO)
csa103CompI1: Hello World! (seq=1)
Mon Jul 14 00:09:11 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00067 : INFO)
csa103CompI1: Hello World! (seq=2)
Mon Jul 14 00:09:12 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00068 : INFO)
csa103CompI1: Hello World! (seq=3)
Mon Jul 14 00:09:13 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00069 : INFO)
csa103CompI1: Hello World! (seq=4)
|
Next, find the active csa103 process, the one that's printing the Hello World lines and kill it. To find the process ID issue the following command from a bash shell.
# ps -eaf | grep csa103
This should produce an output that looks similar to the following.
root 17830 15663 0 14:21 ? 00:00:00 csa103Comp -p
root 17839 15663 0 14:21 ? 00:00:00 csa103Comp -p
root 18558 16145 0 14:32 pts/4 00:00:00 grep csa103
Notice the two entries that end with csa103Comp -p. These are our two component processes. The first one is usually the active process. This is the one that we will kill. In this case the process ID is 17830. So to kill the active component you issue the command:
# kill -9 17830
If this step does not result in the active component being killed then it is likely that the standby component was killed. In this case simply try killing the other process.
After killing the active component you should see lines in the log files like the following:
/root/asp/var/log/csa103CompI1Log.latest
|
Mon Jul 14 00:16:43 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00515 : INFO)
csa103CompI1: Hello World! (seq=452)
Mon Jul 14 00:16:44 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00516 : INFO)
csa103CompI1: Hello World! (seq=453)
Mon Jul 14 00:16:45 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00517 : INFO)
csa103CompI1: Hello World! (seq=454)
Mon Jul 14 00:16:46 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00518 : INFO)
csa103CompI1: Hello World! (seq=455)
|
/var/log/csa103CompI0Log.latest
|
Mon Jul 14 00:16:48 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00065 : INFO)
csa103CompI0: Active state requested from state 2
Mon Jul 14 00:16:48 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00066 : INFO)
csa103CompI0 reading checkpoint
Mon Jul 14 00:16:48 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00067 : INFO)
csa103CompI0 read checkpoint: seq = 456
Mon Jul 14 00:16:49 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00068 : INFO)
csa103CompI0: Hello World! (seq=456)
Mon Jul 14 00:16:50 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00069 : INFO)
csa103CompI0: Hello World! (seq=457)
Mon Jul 14 00:16:51 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00070 : INFO)
csa103CompI0: Hello World! (seq=458)
|
Where we can see CompI1 printing 455 and then dieing, where upon CompI0 gets the notice to take over processing, reads the checkpoint and then takes over with seq=456 and so on.
Then, in the CompI1 log file we see:
/root/asp/var/log/csa103CompI1Log.latest
|
Mon Jul 14 00:16:49 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00044 : INFO)
csa103: Instantiated as component instance csa103CompI1.
Mon Jul 14 00:16:49 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00045 : INFO)
csa103CompI1: Waiting for CSI assignment...
Mon Jul 14 00:16:49 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00046 : INFO)
csa103CompI1: Waiting for CSI assignment...
Mon Jul 14 00:16:49 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00047 : INFO)
csa103CompI1: checkpoint_initialize
Mon Jul 14 00:16:50 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00054 : INFO)
csa103CompI1: Checkpoint service initialized (handle=0x1)
Mon Jul 14 00:16:50 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00056 : INFO)
csa103CompI1: Checkpoint opened (handle=0x2)
|
That is the component that had been killed being restarted.
CompI0 is moving along just fine, and we see CompI1 come back up. If we then kill CompI0 we see:
/root/asp/var/log/csa103CompI1Log.latest
|
Mon Jul 14 00:29:44 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00065 : INFO)
csa103CompI1: Active state requested from state 2
Mon Jul 14 00:29:44 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00066 : INFO)
csa103CompI1 reading checkpoint
Mon Jul 14 00:29:44 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00067 : INFO)
csa103CompI1 read checkpoint: seq = 1225
Mon Jul 14 00:29:45 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00068 : INFO)
csa103CompI1: Hello World! (seq=1225)
Mon Jul 14 00:29:46 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00069 : INFO)
csa103CompI1: Hello World! (seq=1226)
|
Where we can see the CompI0 process die, and CompI1 process read the sequence number from the checkpoint and then take over from where CompI0 left off.
To stop csa103 use the following SAFplus Platform Console command.
cli[Test:SCNodeI0:CPM]-> amsLockAssignment sg csa103SGI0
Now change the state of csa103SGI0 to LockInstantiation and close the SAFplus Platform Console.
cli[Test:SCNodeI0:CPM]-> amsLockInstantiation sg csa103SGI0
cli[Test:SCNodeI0:CPM] -> end
cli[Test:SCNodeI0] -> end
cli[Test] -> bye
</ol>
Additional Tests for Runtime Hardware Setup 1.3 and 2.3
Now repeat the experiment till the stage before we kill the active process and follow these steps.
Step A
- Stop SAFplus Platform on SCNodeI0 using
/etc/init.d/asp stop . Observer the logs /root/asp/var/log/csa103CompI3Log.latest on SCNodeI1. This will become active and start printing the hello world logs as above.
Note in this case active System Controller SCNode1 is still aware of PayloadNodeI0 and PayloadNodeI1.
- Start looking at the logs for
/root/asp/var/log/csa103CompI0Log.latest in PayloadNodeI0. This will print the standby logs as above.
- Stop SAFplus Platform on PayloadNodeI0 and start observing the logs on PayloadNodeI1 in
/root/asp/var/log/csa103CompI1Log.latest for standby.
In this case, you can see, via the active System Controller logs that PayloadNodeI0 is not active whilst PayloadNodeI1 is.
Step B For Hardware Setup 1.3 only
- Repeat A with the slight exception that the blade running SCNodeI0 is yanked out(
/etc/init.d/asp zap ) instead of gracefully shutting down SAFplus Platform in step 1.
Summary and References
We've seen :
- how to use Clovis' checkpoint service to save some program state and recover that state from a separate process
- how to initialize the client checkpoint library
- how to create and open checkpoints
- how to create and use sections in an opened checkpoint
Further information can be found within the following: OpenClovis API Reference Guide.
|