Difference between revisions of "Doc:latest/evalguide/csa103"

(Writing)
(Reading)
Line 279: Line 279:
  
  
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
+
====Putting it all Together====
! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
+
|-
+
|<code><pre>
+
ClCompAppInitialize()
+
  
        ...
+
These functions are called from the application's clCompAppMain.c file to implement an application that simply prints a sequence count. But this count is checkpointed so it is not affected by process failure!
  
        if ((rc = checkpoint_initialize()) != SA_AIS_OK)
+
First, let's define what the program will do when it becomes active. The first step is to take over the "active replica" of the checkpoint so that the process can write the checkpointNext, it attempts to read the checkpoint to recover the last sequence number checkpointed by any prior "active" processFinally, it enters the "active" loop and repeatedly increments the sequence number and writes the checkpoint.
        {
+
            clprintf(CL_LOG_SEV_ERROR,"%s: Failed [0x%x] to initialize checkpoint",
+
                    appname, rc);
+
            return rc;
+
        }
+
        if ((rc = checkpoint_read_seq(&seq)) != SA_AIS_OK)
+
        {
+
            clprintf(CL_LOG_SEV_ERROR,"%s: Failed [0x%x] to read checkpoint",
+
                    appname, rc);
+
            checkpoint_finalize();
+
            return rc;
+
        }
+
    #ifdef CL_INST
+
        if ((rc = clDataTapInit(DATA_TAP_DEFAULT_FLAGS, 103)) != SA_AIS_OK)
+
        {
+
            clprintf(CL_LOG_SEV_ERROR,"%s: Failed [0x%x] to initialize data tap",
+
                    appname, rc);
+
        }
+
    #endif
+
 
+
  </pre></code>
+
|}
+
 
+
The above software in <code>clCompAppMain.c</code> is new to csa103. The call to <code>checkpoint_initialize</code> is pretty straightforward.  This function discussed in detail later within this section.  The call to <code>checkpoint_read_seq</code> is also pretty straightforward.  It reads the current sequence value from the checkpoint (at this point it should be zero) and loads it into the variable: <code>seq</code>It is important to note the call to <code>checkpont_finalize</code> if the call to <code>checkpoint_read_seq</code> does not return <code>CL_OK</code>This call closes the checkpoint and cleanly finalizes the checkpoint library.  We'll look more at <code>checkpoint_read_seq</code> and <code>checkpoint_finalize</code> later.
+
  
 
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
 
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
Line 317: Line 289:
 
  |-  
 
  |-  
 
  |<code><pre>
 
  |<code><pre>
        /* Checkpoint new sequence number */
+
void* csa103CkptActive(ClPtrT unused)
        rc = checkpoint_write_seq(seq);
+
{
        if (rc != SA_AIS_OK)
+
    ClRcT rc = CL_OK;
        {
+
            clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Checkpoint write failed. Exiting.", appname);
+
            break;
+
        }
+
  </pre></code>
+
|}
+
  
The rest of the csa103's main loop is the same as the main loop in csa102, but the seven lines above are new.  Here we write the new value of the checkpoint variable to the checkpoint section and print and error if this fails.  We'll look closer at <code>checkpoint_write_seq</code> later.
+
    /* This thread will only be started when the application becomes active */
 +
    assert(ha_state == SA_AMF_HA_ACTIVE);   
  
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
+
    clprintf(CL_LOG_SEV_INFO,"Active thread has started");
! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
+
    /* Set this replica "active" so I can write to the checkpoint */
|-
+
    if ((rc = saCkptActiveReplicaSet(ckpt_handle)) != SA_AIS_OK)
|<code><pre>
+
     {
     ClRcT
+
        clprintf(CL_LOG_SEV_ERROR, "checkpoint_replica_activate failed [0x%x] in ActiveReplicaSet", rc);
    clCompAppAMFCSISet(
+
     }
        ClInvocationT      invocation,
+
        const ClNameT      *compName,
+
        ClAmsHAStateT      haState,
+
        ClAmsCSIDescriptorT csiDescriptor)
+
     {   
+
        /*
+
        * ---BEGIN_APPLICATION_CODE---
+
        */
+
        ClCharT    compname[100]={0};
+
  
        /*
+
    /* Attempt to recover the state of the prior active */
        * ---END_APPLICATION_CODE---
+
    checkpoint_read_seq(&seq);
        */
+
 
+
        /*
+
        * Print information about the CSI Set
+
        */
+
        strncpy(compname, compName->value, compName->length);
+
 
+
        clprintf (CL_LOG_SEV_INFO, "Component [%s] : PID [%d]. CSI Set Received",
+
              compname, (int)mypid);
+
 
+
        clCompAppAMFPrintCSI(csiDescriptor, haState);
+
 
      
 
      
        /*
+
    while (!unblockNow)
          Take appropriate action based on state
+
     {
        */
+
         clprintf(CL_LOG_SEV_INFO,"Hello World! (seq=%d)", seq++);           
      
+
         /* Checkpoint new sequence number */
         switch ( haState )
+
        checkpoint_write_seq(seq);
         {
+
        sleep(1);
            case CL_AMS_HA_STATE_ACTIVE:
+
    }
            {
+
     return CL_OK;
                /*
+
}
                  AMF has requested application to take the active HA state
+
</pre></code>
                  for the CSI.
+
                */
+
      
+
                /*
+
                  ---BEGIN_APPLICATION_CODE---
+
                */
+
  </pre></code>
+
 
|}  
 
|}  
  
 +
Next, we move to the "main" function.  In the initialization portion we will also initialize the checkpoint:
  
  
Line 383: Line 324:
 
  ! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
 
  ! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
 
  |-  
 
  |-  
  |<code><pre>  
+
  |<code><pre>
                clprintf(CL_LOG_SEV_INFO,"%s: Active state requested from state %d",
+
...
                        appname, ha_state);
+
  /*
 +
    * Print out standard information for this component.
 +
    */
  
                if (ha_state == SA_AMF_HA_STANDBY)
+
    clEoMyEoIocPortGet(&iocPort);
                {
+
   
                    /* Read checkpoint, make our replica the active replica */
+
    clprintf (CL_LOG_SEV_INFO, "Component [%.*s] : PID [%d]. Initializing\n", appName.length, appName.value, mypid);
                    clprintf(CL_LOG_SEV_INFO,"%s reading checkpoint", appname);
+
    clprintf (CL_LOG_SEV_INFO, "   IOC Address            : 0x%x\n", clIocLocalAddressGet());
                    rc = checkpoint_read_seq(&seq);
+
    clprintf (CL_LOG_SEV_INFO, "   IOC Port                : 0x%x\n", iocPort);
                    clprintf(CL_LOG_SEV_INFO,"%s read checkpoint: seq = %u", appname, seq);
+
                }
+
                checkpoint_replica_activate();           
+
                ha_state = SA_AMF_HA_ACTIVE;
+
 
+
 
+
                /*
+
                * ---END_APPLICATION_CODE---
+
                */
+
 
+
                clCpmResponse(cpmHandle, invocation, CL_OK);
+
                break;
+
            }
+
 
+
            case CL_AMS_HA_STATE_STANDBY:
+
            {
+
                /*
+
                * AMF has requested application to take the standby HA state
+
                * for this CSI.
+
                */
+
 
+
                /*
+
                * ---BEGIN_APPLICATION_CODE---
+
                */
+
 
+
                clprintf(CL_LOG_SEV_INFO," Standby state requested from state %d",ha_state);
+
           
+
                ha_state = SA_AMF_HA_STANDBY;
+
 
+
                /*
+
                * ---END_APPLICATION_CODE---
+
                */
+
 
+
                clCpmResponse(cpmHandle, invocation, CL_OK);
+
                break;
+
        }
+
  
 +
    checkpoint_initialize();
 +
...
 
   </pre></code>
 
   </pre></code>
 
|}  
 
|}  
  
  
 
+
Next, we enter the dispatch loopThe auto-generated code creates a loop that handles the AMF processing.  This generated loop has been extended to also handle checkpoint processing. Note that it is also perfectly acceptable (even easier) to spawn 2 threads and have one handle the AMF dispatches and the other checkpoint dispatchesThis code has been implemented in this manner to show how it can all be done in a single thread.
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
+
! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
+
  |-  
+
|<code><pre>   
+
            case CL_AMS_HA_STATE_QUIESCED:
+
            {
+
                /*
+
                * AMF has requested application to quiesce the CSI currently
+
                * assigned the active or quiescing HA state. The application
+
                * must stop work associated with the CSI immediately.
+
                */
+
 
+
                /*
+
                * ---BEGIN_APPLICATION_CODE---
+
                */
+
 
+
                clprintf(CL_LOG_SEV_INFO,"%s: QUIESCED", appname);
+
                ha_state = haState;
+
 
+
                /*
+
                * ---END_APPLICATION_CODE---
+
                */
+
 
+
                clCpmResponse(cpmHandle, invocation, CL_OK);
+
                break;
+
            }
+
 
+
            case CL_AMS_HA_STATE_QUIESCING:
+
            {
+
                /*
+
                * AMF has requested application to quiesce the CSI currently
+
                * assigned the active HA state. The application must stop work
+
                * associated with the CSI gracefully and not accept any new
+
                * workloads while the work is being terminated.
+
                */
+
 
+
                /*
+
                * ---BEGIN_APPLICATION_CODE---
+
                */
+
 
+
                clprintf(CL_LOG_SEV_INFO,"%s: QUIESCING", appname);
+
                ha_state = haState;
+
 
+
                /*
+
                * ---END_APPLICATION_CODE---
+
                */
+
 
+
                clCpmCSIQuiescingComplete(cpmHandle, invocation, CL_OK);
+
                break;
+
            }
+
  </pre></code>
+
|}
+
 
+
 
+
 
+
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
+
  ! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
+
|-
+
|<code><pre>
+
            default:
+
            {
+
                break;
+
            }
+
        }
+
   
+
        return CL_OK;
+
    }
+
  </pre></code>
+
|}
+
 
+
 
+
Now looking at csa103's implementation of <code>clCompAppAMFCSISet</code>, we see that rather than just a few cases, we handle several cases: <code>QUIESCING</code>, <code>QUIESCED</code>, <code>ACTIVE</code>, and  <code>STANDBY</code>.  In each state, the global variable <code>ha_state</code> to the new_state value that is passed in is set.  This is similar to csa102, except that we add the feature that in the <code>CL_AMS_HA_STATE_ACTIVE</code> case if our previous state was <code>CL_AMS_HA_STATE_STANDBY</code> we read the sequence from the checkpoint so that the main loop will pick it up.
+
  
 
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
 
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
Line 511: Line 348:
 
  |-  
 
  |-  
 
  |<code><pre>
 
  |<code><pre>
     static ClRcT
+
     /*
     checkpoint_initialize()
+
    * Block on AMF dispatch file descriptor for callbacks
 +
    */
 +
     if (1)
 
     {
 
     {
        SaAisErrorT      rc = CL_OK;
 
        SaVersionT ckpt_version = {'B', 1, 1};
 
        SaNameT    ckpt_name = { strlen(CKPT_NAME), CKPT_NAME };
 
        ClUint32T  seq_no;
 
        SaCkptCheckpointCreationAttributesT create_atts = {
 
            .creationFlags    = SA_CKPT_WR_ACTIVE_REPLICA_WEAK |
 
                            SA_CKPT_CHECKPOINT_COLLOCATED,
 
            .checkpointSize    = sizeof(ClUint32T),
 
            .retentionDuration = (ClTimeT)10,
 
            .maxSections      = 2, // two sections
 
            .maxSectionSize    = sizeof(ClUint32T),
 
            .maxSectionIdSize  = (ClSizeT)64
 
 
 
          
 
          
         };
+
         SaSelectionObjectT amf_dispatch_fd;
         SaCkptSectionCreationAttributesT section_atts = {
+
         SaSelectionObjectT ckpt_dispatch_fd;
         .sectionId = &ckpt_sid,
+
         int maxFd;
         .expirationTime = SA_TIME_END
+
         fd_set read_fds;
  
        };
 
 
        clprintf(CL_LOG_SEV_INFO,"%s: checkpoint_initialize", appname);
 
        /* Initialize checkpointing service instance */
 
        rc = saCkptInitialize(&ckpt_svc_handle, /* Checkpoint service handle */
 
  NULL,     /* Optional callbacks table */
 
  &ckpt_version);  /* Required verison number */
 
        if (rc != SA_AIS_OK)
 
        {
 
            clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Failed to initialize checkpoint service",
 
                    appname);
 
            return rc;
 
        }
 
        clprintf(CL_LOG_SEV_INFO,"%s: Checkpoint service initialized (handle=0x%llx)",
 
          appname, ckpt_svc_handle);
 
  </pre></code>
 
|}
 
 
Here is <code>checkpoint_initialize</code>.  Here we initialize the <code>ClNameT</code> structure that holds the name of the checkpoint to be opened/created with the line: <code>ClNameT    ckpt_name = { strlen(CKPT_NAME), CKPT_NAME };</code>. We then define a set of attributes to associate with the checkpoint when creating that checkpoint. The <code>creationFlags</code> includes <code>CL_CKPT_WR_ACTIVE_REPLICA</code>.  This means that the checkpoint to be created will be asynchronous, or once active server updation is completed, the call returns to the application, all other replica updates happens parallel. <code>CheckpointSize</code> is set to the sizeof a 32 bit unsigned integer, which is the sequence number we print in the main loop.  The <code>retentionDuration</code> is the time that the checkpoint service will keep the checkpoint in store after the last client has closed the checkpoint.  <code>MaxSections</code> is 2 as this checkpoint will be used to store two sections.  <code>MaxSectionSize</code> is the set to the sizeof a 32 bit unsigned integer, as application stores only unsigned integer. And <code>maxSectionIdSize</code> is 64 bytes.
 
 
We next use <code>ClCkptSectionCreationAttributesT section_atts</code> to define the creation attributes for the sole section of the checkpoint. The <code>sectionId</code> is just the name of the section along with the number of bytes in the name.  The <code>expirationTime</code> is set to <code>CL_TIME_END</code> to imply that the section is never deleted automatically if the checkpoint itself still remains.
 
 
The call to <code>clCkptInitialize</code> has to be called before any other checkpoint functions.  The <code>ckpt_svc_handle</code> is where the handle is returned.  The handle must be passed to some future checkpoint api calls, namely the <code>clCkptCheckpointOpen</code> and <code>clCkptFinalize</code> calls.  The callbacks table is passed as <code>NULL</code> which implies that our application doesn't provide any callbacks.  Finally the <code>ckpt_version</code> identifies what version of the api this application is written to.
 
 
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
 
! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
 
|-
 
|<code><pre>
 
        //
 
        // Create the checkpoint for read and write.           
 
 
        rc = saCkptCheckpointOpen(ckpt_svc_handle,      // Service handle
 
                              &ckpt_name,        // Checkpoint name
 
                              &create_atts,      // Optional creation attr.
 
                              (SA_CKPT_CHECKPOINT_READ |
 
                              SA_CKPT_CHECKPOINT_WRITE |
 
                              SA_CKPT_CHECKPOINT_CREATE),
 
                              (SaTimeT)-1,        // No timeout
 
                              &ckpt_handle);      // Checkpoint handle
 
 
        if (rc != SA_AIS_OK)
 
 
        {
 
            clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Failed [0x%x] to open checkpoint",
 
                    appname, rc);
 
            (void)saCkptFinalize(ckpt_svc_handle);
 
            return rc;
 
        }
 
        clprintf(CL_LOG_SEV_INFO,"%s: Checkpoint opened (handle=0x%llx)", appname, ckpt_handle);
 
  </pre></code>
 
|}
 
 
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
 
! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
 
|-
 
|<code><pre>
 
 
         /*
 
         /*
         * Try to create a section so that updates can operate by overwriting
+
         * Get the AMF dispatch FD for the callbacks
        * the section over and over again.
+
        * If subsequent processes come through here, they will fail to create
+
        * the section.  That is OK, even though it will cause an error message
+
        * If the section create fails because the section is already there, then
+
        * read the sequence number
+
 
         */
 
         */
         // Put data in network byte order
+
         if ( (rc = saAmfSelectionObjectGet(amfHandle, &amf_dispatch_fd)) != SA_AIS_OK)
        seq_no = htonl(seq);
+
            goto errorexit;
 
+
         if ( (rc = saCkptSelectionObjectGet(ckptLibraryHandle, &ckpt_dispatch_fd)) != SA_AIS_OK)
        // Creating the section
+
            goto errorexit;
        checkpoint_replica_activate();
+
   
        rc = saCkptSectionCreate(ckpt_handle,           // Checkpoint handle
+
        maxFd = max(amf_dispatch_fd,ckpt_dispatch_fd);
                            &section_atts,        // Section attributes
+
        do
                            (SaUint8T*)&seq_no,    // Initial data
+
                            (SaSizeT)sizeof(seq_no)); // Size of data
+
         if (rc != SA_AIS_OK && (CL_GET_ERROR_CODE(rc) != SA_AIS_ERR_EXIST))
+
 
         {
 
         {
             clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Failed to create checkpoint section", appname);
+
             FD_ZERO(&read_fds);
             (void)saCkptCheckpointClose(ckpt_handle);
+
             FD_SET(amf_dispatch_fd, &read_fds);
             (void)saCkptFinalize(ckpt_svc_handle);
+
             FD_SET(ckpt_dispatch_fd, &read_fds);
            return rc;
+
          
         }
+
            if( select(maxFd + 1, &read_fds, NULL, NULL, NULL) < 0)
        else if (rc != SA_AIS_OK && (CL_GET_ERROR_CODE(rc) == SA_AIS_ERR_EXIST))
+
        {
+
            rc = checkpoint_read_seq(&seq);
+
            if (rc != CL_OK)
+
 
             {
 
             {
                 clprintf(CL_LOG_SEV_ERROR,"%s: ERROR: Failed [0x%x] to read checkpoint section",
+
                 char errorStr[80];
                            appname, rc);
+
                int err = errno;
                 (void)saCkptCheckpointClose(ckpt_handle);
+
                 if (EINTR == err)
                 (void)saCkptFinalize(ckpt_svc_handle);
+
                {
                 return rc;
+
                    continue;
 +
                }
 +
                errorStr[0] = 0; /* just in case strerror does not fill it in */
 +
                strerror_r(err, errorStr, 80);
 +
                 clprintf (CL_LOG_SEV_ERROR, "Error [%d] during dispatch loop select() call: [%s]",err,errorStr);
 +
                 break;
 
             }
 
             }
        }
+
            if (FD_ISSET(amf_dispatch_fd,&read_fds)) saAmfDispatch(amfHandle, SA_DISPATCH_ALL);
        else
+
             if (FD_ISSET(ckpt_dispatch_fd,&read_fds)) saCkptDispatch(ckptLibraryHandle, SA_DISPATCH_ALL);
        {
+
         }while(!unblockNow);    
             clprintf(CL_LOG_SEV_INFO,"%s: Section created", appname);
+
         }
+
 
+
        return CL_OK;
+
 
     }
 
     }
 
   </pre></code>
 
   </pre></code>
|}
+
|}  
  
With <code>rc = clCkptCheckpointOpen</code> we attempt to open the specified checkpoint.  When <code>checkpoint_initialize</code> is called the checkpoint may or may not exist.  We attempt to open the checkpoint for READ/WRITE access.
 
  
Next we check if we created the checkpoint and then we need to create the section with a call to <code>clCkptSectionCreate</code>.  We pass the checkpoint handle obtained from <code>clCkptCheckpointOpen</code>, the section attributes declared earlier, and the address of the initial value along with the size in bytes of the initial value.
+
Finally, when this application becomes active, we need to kick off the csa103CkptActive function in a new thread:
  
We convert the initial value to network byte order by the code <code>seq_no = htonl(seq)</code>, before passing it to <code>clCkptSectionCreate</code>.
 
 
If we do not create the section, then we read the current value in the checkpoint into our global seq variable.
 
 
Note that whenever we return an error return code from the function that we have either not opened the checkpoint/initialized the checkpoint service, or we have closed the checkpoint and finalized the checkpoint service with calls to <code>clCkptCheckpointClose</code> and <code>clCkptFinalize</code>.
 
  
 
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
 
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
Line 649: Line 402:
 
  |-  
 
  |-  
 
  |<code><pre>
 
  |<code><pre>
    static ClRcT
+
         case SA_AMF_HA_ACTIVE:
    checkpoint_finalize(void)
+
    {
+
         SaAisErrorT rc;
+
 
+
        rc = saCkptCheckpointClose(ckpt_handle);
+
        if (rc != SA_AIS_OK)
+
        {
+
            clprintf(CL_LOG_SEV_ERROR,"%s: failed: [0x%x] to close checkpoint handle 0x%llx",
+
                        appname, rc, ckpt_handle);
+
        }
+
        rc = saCkptFinalize(ckpt_svc_handle);
+
        if (rc != SA_AIS_OK)
+
        {
+
            clprintf(CL_LOG_SEV_ERROR,"%s: failed: [0x%x] to finalize checkpoint",
+
                    appname, rc);
+
        }
+
        return CL_OK;
+
 
+
    }
+
  </pre></code>
+
|}
+
 
+
The above code snippet presents <code>checkpoint_finalize</code>.  It's quite simple.  First it closes the checkpoint handle with a call to <code>clCkptCheckpointClose</code>.  Next it finalizes the checkpoint library with a call to <code>clCkptFinalize</code>.
+
 
+
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
+
! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
+
|-
+
|<code><pre>
+
    static ClRcT
+
    checkpoint_write_seq(ClUint32T seq)
+
    {
+
        SaAisErrorT rc = SA_AIS_OK;
+
        ClUint32T seq_no;
+
 
+
        /* Putting data in network byte order */
+
        seq_no = htonl(seq);
+
 
+
        /* Write checkpoint */
+
        retry:
+
        rc = saCkptSectionOverwrite(ckpt_handle,
+
                                &ckpt_sid,
+
                                &seq_no,
+
                                sizeof(ClUint32T));
+
        if (rc != SA_AIS_OK)
+
        {
+
            clprintf(CL_LOG_SEV_ERROR,"Failed [0x%x] to write to section", rc);
+
            if(rc == SA_AIS_ERR_NOT_EXIST)
+
                rc = checkpoint_replica_activate();
+
            if(rc == CL_OK) goto retry;
+
        }
+
        else
+
 
         {
 
         {
 
             /*
 
             /*
            * Synchronize the checkpoint to all the replicas.
+
            * AMF has requested application to take the active HA state
            */
+
            * for the CSI.
             rc = saCkptCheckpointSynchronize(ckpt_handle, SA_TIME_END );
+
            */
            if (rc != SA_AIS_OK)
+
             clprintf(CL_LOG_SEV_INFO,"Active state requested from state %d", ha_state);
            {
+
              
                clprintf(CL_LOG_SEV_ERROR,"Failed [0x%x] to synchronize the checkpoint", rc);
+
            ha_state = SA_AMF_HA_ACTIVE;
             }
+
        }
+
  
        return CL_OK;
+
            pthread_t thr;
    }
+
             pthread_create(&thr,NULL,csa103CkptActive,NULL);
  </pre></code>
+
           
|}
+
            saAmfResponse(amfHandle, invocation, SA_AIS_OK);
 
+
            break;
With <code>checkpoint_write_seq</code> we first convert the sequence number passed into network byte order.  Then we pass it to <code>clCKptSectionOverwrite</code> which writes it to the checkpoint section created in <code>checkpoint_initialize</code>.
+
 
+
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
+
! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
+
|-
+
|<code><pre>
+
    static ClRcT
+
    checkpoint_read_seq(ClUint32T *seq)
+
    {
+
        ClRcT rc = CL_OK;
+
        ClUint32T err_idx; /* Error index in ioVector */
+
        ClUint32T seq_no = 0xffffffff;
+
        SaCkptIOVectorElementT iov = {
+
             .sectionId  = ckpt_sid,
+
            .dataBuffer = (ClPtrT)&seq_no,
+
            .dataSize  = sizeof(ClUint32T),
+
            .dataOffset = (ClOffsetT)0,
+
            .readSize  = sizeof(ClUint32T)
+
        };
+
 
+
        rc = saCkptCheckpointRead(ckpt_handle, &iov, 1, &err_idx);
+
        if (rc != SA_AIS_OK)
+
        {
+
          clprintf(CL_LOG_SEV_ERROR,"Error: [0x%x] from checkpoint read, err_idx = %u",
+
                    rc, err_idx);
+
 
         }
 
         }
 
        /* FIXME: How to process this err_idx? */
 
        *seq = ntohl(seq_no);
 
 
        return CL_OK;
 
 
    }
 
 
 
   </pre></code>
 
   </pre></code>
 
|}  
 
|}  
  
With <code>checkpoint_read_seq</code> we initialize <code>seq_no</code> to all ones just so it will be more obvious if the <code>clCkptCheckpointRead</code> call fails and the value of <code>seq_no</code> doesn't get overwritten.  Then, we set up the <code>iov</code> variable.  We set the <code>sectionID</code>  field to <code>ckpt_sid</code> which is the id of the section created in <code>checkpoint_initialize</code>.  The <code>dataBuffer</code> gets set to the address of the <code>seq_no</code> variable which is where we want the checkpoint data loaded.  The <code>dataSize</code> field is set to the the size of the <code>seq_no</code> variable.  <code>DataOfset</code> is set to zero since the sequence number is the only thing stored in the section, so it resides at the front of the section.  <code>readSize</code> is set to the size of the sequence number variable.
 
  
Within <code>checkpoint_read_seq</code> we pass the iovector to <code>clCkptCheckpointRead</code> along with the <code>ckpt_handle</code>, the constant 1, and the address of the <code>err_idx</code> variableThe constant 1 is the number of elements in the array of <code>ClCkptIOVectorElementT structs</code> that we're passing.  That is, we're passing the address of a single <code>ClCkptIOVectorElementT struct</code>.  The <code>err_idx</code> will hold the index in that array where <code>clCkptCheckpointRead</code> found an error.  That is, if there is going to be an error, it will be at index 0 since we're only passing on entry in the <code>iovector</code>.
+
In this example, there is nothing to do while the process is assigned standbyThis actually can be a problem; if the checkpoint is large it may take quite some time for the newly-assigned active process to completely read and process the checkpoint.  The standby can read the checkpoint but how can it determine what it needs to read; what sections the active has changed?
  
 +
Thus the SA-Forum checkpoint definition is not a "cold standby" because the process is running, but it is also not quite "hot"; we call it a "warm-standby". 
  
{| cellspacing="0" cellpadding = "0" border="0" align = "center" width="680"
+
But the OpenClovis SAFplus Platform provides an extension to the SA-Forum checkpoint definition that realizes true hot standby behaviorThis extension will call an application registered call-back function whenever the active writes the checkpoint with all the write information. This allows the standby to remain in sync with the active so the checkpoint does not need to be read upon failover.
! style="color:black;background-color:#ffccaa;" align="center"| clCompAppMain.c
+
  |-  
+
  |<code><pre>
+
 
+
    static ClRcT
+
    checkpoint_replica_activate(void)
+
    {
+
        SaAisErrorT rc = SA_AIS_OK;
+
 
+
        if ((rc = saCkptActiveReplicaSet(ckpt_handle)) != SA_AIS_OK)
+
        {
+
            clprintf(CL_LOG_SEV_ERROR,
+
                    "checkpoint_replica_activate failed [0x%x] in ActiveReplicaSet",
+
                    rc);
+
        }
+
        else rc = CL_OK;
+
 
+
        return rc;
+
    }
+
  </pre></code>
+
|}
+
  
 
===How to Run csa103 and What to Observe===
 
===How to Run csa103 and What to Observe===

Revision as of 20:28, 15 January 2013

Contents

csa103 Checkpointing

Checkpointing refers to the practice of saving crucial program state in the cluster's "cloud" so that a standby process can resume operation without loss of state upon failure of the active process. This data is saved via the SAFplus Checkpointing APIs into what appears to be a simple key/value in-RAM database (like a hash table). But this "database" is actually replicated by our checkpointing servers and can be shared between multiple components.

Adding checkpointing to your application is simple. You first identify crucial program state and then you have the "active" process write it to a special key/value-database-like subsystem via the Checkpointing APIs. When the "standby" process becomes active, you read the checkpoint and update the program state before resuming service.

Objective

csa103 demonstrates the use of the SAFplus Checkpoint service to provide basic failure recovery. csa103 builds on csa102. The SAFplus Checkpoint service is SA-Forum compliant and this tutorial expects that you will refer to the SA-Forum document for an in-depth description of the checkpointing APIs used.

What Will You Learn

  • how to initialize the checkpoint client library,
  • how to create a checkpoint and open a checkpoint,
  • how to create a section in the checkpoint
  • how to save data to the checkpoint section and read data from the checkpoint section.
  • how to extend the SAF event dispatch loop to handle multiple services.

The Code

The code can be found within the following directory

<project-area_dir>/eval/src/app/csa103Comp

To increase readability, all checkpointing code has been isolated into a single module that consists of 2 files: checkpointFns.c and checkpointFns.h. These files provide the following APIs.

checkpointFns.h
extern SaAisErrorT checkpoint_initialize(void);
extern void        checkpoint_finalize(void);
extern ClRcT       checkpoint_write_seq(ClUint32T);
extern ClRcT       checkpoint_read_seq(ClUint32T*);
  

These APIs constitute the basic operations of any messaging library; initialize, open, send and receive.

This file also declares some checkpoint handles as global variables and defines a "well-known" name for our checkpoint table. This simple example will only have a single row/key in the table (called a "section" in SA-Forum terminology), so we also define a well-known name for the row.

checkpointFns.h
#define CKPT_NAME     "csa103Ckpt"    /* Checkpoint table name for this application  */
SaCkptHandleT  ckptLibraryHandle;     /* library hangle                              */
SaCkptCheckpointHandleT ckpt_handle;  /* Checkpoint table handle                     */
#define CKPT_SID_NAME "1"             /* Checkpoint section id                       */
extern SaCkptSectionIdT ckpt_sid;


Initialization

Initialization follows the standard SA-Forum library lifecycle pattern where before use an initialize call must occur that returns a handle that is used in subsequent library accesses.

In this example, we will both initialize the library and open the checkpoint table in the same function.

The following code excerpts handle the library initialization:

checkpointFns.c
SaAisErrorT checkpoint_initialize(void)
{
    SaAisErrorT      rc = CL_OK;
    SaVersionT ckpt_version = {'B', 1, 1};
    SaCkptCallbacksT callbacks;

 ...

    /* These callbacks are only called when an Async style function call completes */
    callbacks.saCkptCheckpointOpenCallback        = checkpointOpenCompleted;
    callbacks.saCkptCheckpointSynchronizeCallback = checkpointSynchronizeCompleted;
        
    clprintf(CL_LOG_SEV_INFO,"Checkpoint Initialize");

        
    /* Initialize checkpointing library */
    rc = saCkptInitialize(&ckptLibraryHandle,	/* Checkpoint service handle */
                          &callbacks,			    /* Optional callbacks table */
                          &ckpt_version);   /* Required verison number */
    if (rc != SA_AIS_OK)
    {
        clprintf(CL_LOG_SEV_ERROR,"Failed to initialize checkpoint service with rc [%#x]", rc);
        return rc;
    }
    clprintf(CL_LOG_SEV_INFO,"Checkpoint service initialized (handle=0x%llx)", ckptLibraryHandle);
...
}
  

In this example, we register callbacks that are called when the asynchronous versions of the open and synchronize APIs are called. For threaded applications, it is simpler to use the default, synchronous APIs. The asynchronous versions are used in this example solely for expositional purposes.

Now let's focus on the portions that open the checkpoint table:

checkpointFns.c
SaAisErrorT checkpoint_initialize(void)
{
    SaAisErrorT      rc = CL_OK;
    SaNameT    ckpt_name = { strlen(CKPT_NAME), CKPT_NAME };
    SaCkptCheckpointCreationAttributesT attrs;

    ...

    /* Set up the checkpoint configuration */
    attrs.creationFlags     = SA_CKPT_WR_ACTIVE_REPLICA_WEAK | SA_CKPT_CHECKPOINT_COLLOCATED;        
    attrs.checkpointSize    = sizeof(ClUint32T);
    attrs.retentionDuration = (ClTimeT) 10;        /* How long to keep the checkpoint around even if no one has it open */
    attrs.maxSections       = 2;                   /* Max number of 'rows' in the table */
    attrs.maxSectionSize    = sizeof(ClUint32T);   /* Max 'row' size */
    attrs.maxSectionIdSize  = (ClSizeT)64;         /* Max 'row' identifier size */

    ...    

    /* Create the checkpoint table for read and write. */
    rc = saCkptCheckpointOpen(ckptLibraryHandle,      // Service handle
                              &ckpt_name,         // Checkpoint name
                              &attrs,       // Optional creation attr.
                              (SA_CKPT_CHECKPOINT_READ |
                               SA_CKPT_CHECKPOINT_WRITE |
                               SA_CKPT_CHECKPOINT_CREATE),
                              (SaTimeT)SA_TIME_MAX,        // No timeout
                              &ckpt_handle);      // Checkpoint handle

    if (rc != SA_AIS_OK)
    {
        clprintf(CL_LOG_SEV_ERROR,"Failed [0x%x] to open checkpoint",rc);
        return rc;
    }
    clprintf(CL_LOG_SEV_INFO,"Checkpoint opened (handle=0x%llx)", ckpt_handle);
    return rc;
}
  

A checkpoint creation attributes structure is first defined and then its fields are set to define the behavior of the checkpoint. The definitions of these fields is beyond the scope of this guide, but are available in the SA-Forum checkpointing documentation.

Next, the saCkptCheckpointOpen function is called to open or create the checkpoint table. As the processes that constitute a redundant application start, their initialization routines are racing each other to create/open the table first. Therefore it is necessary to specify the SA_CKPT_CHECKPOINT_CREATE (in all processes) to ensure that whichever process wins will create the checkpoint.


Writing

Writing to a checkpoint section can fail if the section does not exist or if the process is not the checkpoint's "active replica". The active replica is essentially the checkpoint table's "master" copy and is only needed for checkpoints that are configured to allow only one writer. The active replica is assigned via a simple API call that is wrapped here:

checkpointFns.c
ClRcT checkpoint_replica_activate(void)
{
    SaAisErrorT rc = SA_AIS_OK;

    if ((rc = saCkptActiveReplicaSet(ckpt_handle)) != SA_AIS_OK)
    {
        clprintf(CL_LOG_SEV_ERROR, "checkpoint_replica_activate failed [0x%x] in ActiveReplicaSet", rc);
    }
    else rc = CL_OK;

    return rc;
}

The application will call this API when the AMF tells it to become "active" for the CSI (work assignment), as shown in the "Putting it all Together" section.

The following code wraps the section write API. It first attempts to write the section and if that fails, the code creates the section.

checkpointFns.c
SaAisErrorT checkpoint_write_seq(ClUint32T seq)
{
    SaAisErrorT rc = SA_AIS_OK;
    ClUint32T seq_no;
    
    /* Putting data in network byte order */
    seq_no = htonl(seq);
    
    /* Write checkpoint */
    rc = saCkptSectionOverwrite(ckpt_handle, &ckpt_sid, &seq_no, sizeof(ClUint32T));
    if (rc != SA_AIS_OK)
    {
        /* The error SA_AIS_ERROR_NOT_EXIST can occur either because we are not the active replica OR
           if the overwrite failed because the section did not exist then try to create the section.

           But in this case we know our application will not attempt to write unless it has set itself as
           the active replica so the problem MUST be a missing section.
        */
        if ((rc == 0x1000a) || (rc == SA_AIS_ERR_NOT_EXIST))
        {
            SaCkptSectionCreationAttributesT    section_cr_attr;
            
            /* Expiration time for section can also be varied
             * once parameters are read from a configuration file
             */
            section_cr_attr.expirationTime = SA_TIME_END;
            section_cr_attr.sectionId = &ckpt_sid;
            rc = saCkptSectionCreate(ckpt_handle,&section_cr_attr,(SaUint8T*) &seq_no, sizeof(ClUint32T));            
        }
        if (rc != SA_AIS_OK)
        {
            clprintf(CL_LOG_SEV_ERROR,"Failed [0x%x] to write to section", rc);
        }
    }

    if (rc == SA_AIS_OK)
    {
        /*
         * Synchronize the checkpoint to all the replicas.
         */
#if 0          /* Synchronous synchronize example */
        rc = saCkptCheckpointSynchronize(ckpt_handle, SA_TIME_END );
        if (rc != SA_AIS_OK)
        {
            clprintf(CL_LOG_SEV_ERROR,"Failed [0x%x] to synchronize the checkpoint", rc);
        }
#else
        rc = saCkptCheckpointSynchronizeAsync(ckpt_handle,syncCount);
        syncCount++;
        
#endif        
    }

    return rc;
}

  

The code also synchronizes the checkpoint. This synchronize operation ensures that all replicas have up-to-date data, and is only necessary for certain checkpoint configurations. In this case, the code demonstrates both the synchronous and asynchronous APIs. The async API is currently enabled simply do demonstrate the proper function of the combined AMF and checkpoint event dispatch loop (see the "Putting it all Together" section).

Reading

Reading the checkpoint is quite simple. The API allows for reads of multiple sections and allows subsets of the section's data to be read so an "IO-vector" style data structure is first initialized. To do multiple-section reads, this data structure should be an array of SaCkptIOVectorElementT. But in this example we only read a single section so an array is not needed.


checkpointFns.c
ClRcT checkpoint_read_seq(ClUint32T *seq)
{
    SaAisErrorT rc = CL_OK;
    ClUint32T err_idx; /* Error index in ioVector */
    ClUint32T seq_no = 0xffffffff;
    SaCkptIOVectorElementT iov = {
        .sectionId  = ckpt_sid,
        .dataBuffer = (ClPtrT)&seq_no,
        .dataSize   = sizeof(ClUint32T),
        .dataOffset = (ClOffsetT)0,
        .readSize   = sizeof(ClUint32T)
    };
        
    rc = saCkptCheckpointRead(ckpt_handle, &iov, 1, &err_idx);
    if (rc != SA_AIS_OK)
    {
        if (rc != SA_AIS_ERR_NOT_EXIST)  /* NOT_EXIST is ok, just means no active has written */
            return CL_OK;
        
        clprintf(CL_LOG_SEV_ERROR,"Error: [0x%x] from checkpoint read, err_idx = %u", rc, err_idx);
    }

    *seq = ntohl(seq_no);
    
    return CL_OK;
}
  

It is important to handle the case of a missing section properly. Whether your application sees this error depends on the design of the application, but is a common design to see missing sections on application start because no prior "active" existed to fill the checkpoint. In this design, it is only on failover that the sections will have been written.


Putting it all Together

These functions are called from the application's clCompAppMain.c file to implement an application that simply prints a sequence count. But this count is checkpointed so it is not affected by process failure!

First, let's define what the program will do when it becomes active. The first step is to take over the "active replica" of the checkpoint so that the process can write the checkpoint. Next, it attempts to read the checkpoint to recover the last sequence number checkpointed by any prior "active" process. Finally, it enters the "active" loop and repeatedly increments the sequence number and writes the checkpoint.

clCompAppMain.c
void* csa103CkptActive(ClPtrT unused)
{
    ClRcT rc = CL_OK;

    /* This thread will only be started when the application becomes active */
    assert(ha_state == SA_AMF_HA_ACTIVE);    

    clprintf(CL_LOG_SEV_INFO,"Active thread has started");
    /* Set this replica "active" so I can write to the checkpoint */
    if ((rc = saCkptActiveReplicaSet(ckpt_handle)) != SA_AIS_OK)
    {
        clprintf(CL_LOG_SEV_ERROR, "checkpoint_replica_activate failed [0x%x] in ActiveReplicaSet", rc);
    }

    /* Attempt to recover the state of the prior active */
    checkpoint_read_seq(&seq);
    
    while (!unblockNow)
    {
        clprintf(CL_LOG_SEV_INFO,"Hello World! (seq=%d)", seq++);            
        /* Checkpoint new sequence number */
        checkpoint_write_seq(seq);
        sleep(1);
    }
    return CL_OK;
}
 

Next, we move to the "main" function. In the initialization portion we will also initialize the checkpoint:


clCompAppMain.c
...
   /*
     * Print out standard information for this component.
     */

    clEoMyEoIocPortGet(&iocPort);
    
    clprintf (CL_LOG_SEV_INFO, "Component [%.*s] : PID [%d]. Initializing\n", appName.length, appName.value, mypid);
    clprintf (CL_LOG_SEV_INFO, "   IOC Address             : 0x%x\n", clIocLocalAddressGet());
    clprintf (CL_LOG_SEV_INFO, "   IOC Port                : 0x%x\n", iocPort);

    checkpoint_initialize();
...
  


Next, we enter the dispatch loop. The auto-generated code creates a loop that handles the AMF processing. This generated loop has been extended to also handle checkpoint processing. Note that it is also perfectly acceptable (even easier) to spawn 2 threads and have one handle the AMF dispatches and the other checkpoint dispatches. This code has been implemented in this manner to show how it can all be done in a single thread.

clCompAppMain.c
    /*
     * Block on AMF dispatch file descriptor for callbacks
     */
    if (1)
    {
        
        SaSelectionObjectT amf_dispatch_fd;
        SaSelectionObjectT ckpt_dispatch_fd;
        int maxFd;
        fd_set read_fds;

        /*
         * Get the AMF dispatch FD for the callbacks
         */
        if ( (rc = saAmfSelectionObjectGet(amfHandle, &amf_dispatch_fd)) != SA_AIS_OK)
            goto errorexit;
        if ( (rc = saCkptSelectionObjectGet(ckptLibraryHandle, &ckpt_dispatch_fd)) != SA_AIS_OK)
            goto errorexit;
    
        maxFd = max(amf_dispatch_fd,ckpt_dispatch_fd);
        do
        {
            FD_ZERO(&read_fds);
            FD_SET(amf_dispatch_fd, &read_fds);
            FD_SET(ckpt_dispatch_fd, &read_fds);
        
            if( select(maxFd + 1, &read_fds, NULL, NULL, NULL) < 0)
            {
                char errorStr[80];
                int err = errno;
                if (EINTR == err)
                {
                    continue;
                }
                errorStr[0] = 0; /* just in case strerror does not fill it in */
                strerror_r(err, errorStr, 80);
                clprintf (CL_LOG_SEV_ERROR, "Error [%d] during dispatch loop select() call: [%s]",err,errorStr);
                break;
            }
            if (FD_ISSET(amf_dispatch_fd,&read_fds)) saAmfDispatch(amfHandle, SA_DISPATCH_ALL);
            if (FD_ISSET(ckpt_dispatch_fd,&read_fds)) saCkptDispatch(ckptLibraryHandle, SA_DISPATCH_ALL);
        }while(!unblockNow);      
    }
  


Finally, when this application becomes active, we need to kick off the csa103CkptActive function in a new thread:


clCompAppMain.c
        case SA_AMF_HA_ACTIVE:
        {
            /*
             * AMF has requested application to take the active HA state 
             * for the CSI.
             */
            clprintf(CL_LOG_SEV_INFO,"Active state requested from state %d", ha_state);
            
            ha_state = SA_AMF_HA_ACTIVE;

            pthread_t thr;
            pthread_create(&thr,NULL,csa103CkptActive,NULL);
            
            saAmfResponse(amfHandle, invocation, SA_AIS_OK);
            break;
        }
  


In this example, there is nothing to do while the process is assigned standby. This actually can be a problem; if the checkpoint is large it may take quite some time for the newly-assigned active process to completely read and process the checkpoint. The standby can read the checkpoint but how can it determine what it needs to read; what sections the active has changed?

Thus the SA-Forum checkpoint definition is not a "cold standby" because the process is running, but it is also not quite "hot"; we call it a "warm-standby".

But the OpenClovis SAFplus Platform provides an extension to the SA-Forum checkpoint definition that realizes true hot standby behavior. This extension will call an application registered call-back function whenever the active writes the checkpoint with all the write information. This allows the standby to remain in sync with the active so the checkpoint does not need to be read upon failover.

How to Run csa103 and What to Observe

This sample application can run on all Runtime Hardware Setups. The following, lists which node csa103compI* runs on.

  • Runtime Hardware Setup 1.1 and 2.1
    csa103compI0 and csa103compI1 will run upon the single node.
  • Runtime Hardware Setup 1.2 and 2.2
    csa103compI0 runs on PayloadNodeI0 and csa103compI1 runs on PayloadNodeI1
  • Runtime Hardware Setup 1.3 and 2.3
    csa103compI2 and csa103compI3 run on SCNodeI0 and SCNodeI1 respectively. csa103compI0 and csa103compI1 run on PayloadNodeI0 and PayloadNodeI1 respectively.

    In the four node set up SCNodeI0 and SCNodeI1 output data via /root/asp/var/log/csa103CompI2 and /root/asp/var/log/csa103CompI3 respectively. Therefore in this case follow the below instructions replacing csa103CompI0 and csa103compI1 with csa103compI2 and csa103compI3 respectively.

We run csa103 very much the same way we run any of the rest of the sample applications: with the SAFplus Platform Console.

  1. First, on the active System Controller move it to state LockAssignment with (Unlock csa203SGI0 instead of csa103SGI0 to run csa203)
    cli[Test]-> setc 1
    cli[Test:SCNodeI0]-> setc cpm
    cli[Test:SCNodeI0:CPM]-> amsLockAssignment sg csa103SGI0
    

    The following output is given when you run tail -f on the csa103 log files. For example:

    # tail -f /root/asp/var/log/csa103CompI0.log
    
    /root/asp/var/log/csa103CompI0Log.latest
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00030 :   INFO)
     Component [csa103CompI0] : PID [15238]. Initializing
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00031 :   INFO)
        IOC Address             : 0x1
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00032 :   INFO)
        IOC Port                : 0x81
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00033 :   INFO)
     csa103: Instantiated as component instance csa103CompI0.
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00034 :   INFO)
     csa103CompI0: Waiting for CSI assignment...
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00035 :   INFO)
     csa103CompI0: Waiting for CSI assignment...
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00036 :   INFO)
     csa103CompI0: checkpoint_initialize
    Mon Jul 14 00:01:10 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00043 :   INFO)
     csa103CompI0: Checkpoint service initialized (handle=0x1)
    Mon Jul 14 00:01:10 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00045 :   INFO)
     csa103CompI0: Checkpoint opened (handle=0x2)
      
    # tail -f /root/asp/var/log/csa103CompI1.log
    
    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00030 :   INFO)
     Component [csa103CompI1] : PID [15234]. Initializing
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00031 :   INFO)
        IOC Address             : 0x1
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00032 :   INFO)
        IOC Port                : 0x80
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00033 :   INFO)
     csa103: Instantiated as component instance csa103CompI1.
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00034 :   INFO)
     csa103CompI1: Waiting for CSI assignment...
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00035 :   INFO)
     csa103CompI1: Waiting for CSI assignment...
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00036 :   INFO)
     csa103CompI1: checkpoint_initialize
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00043 :   INFO)
     csa103CompI1: Checkpoint service initialized (handle=0x1)
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00045 :   INFO)
     csa103CompI1: Checkpoint opened (handle=0x2)
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00053 :   INFO)
     csa103CompI1: Section created
      
  2. Then, unlock the application by running
    cli[Test:SCNodeI0:CPM]-> amsUnlock sg csa103SGI0
    

    In the application log files, you should see

    /root/asp/var/log/csa103CompI0Log.latest
    Mon Jul 14 00:09:08 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00058 :   INFO)
      Standby state requested from state 0
      
    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:09:08 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00064 :   INFO)
     csa103CompI1: Active state requested from state 0
    Mon Jul 14 00:09:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00065 :   INFO)
     csa103CompI1: Hello World! (seq=0)
    Mon Jul 14 00:09:10 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00066 :   INFO)
     csa103CompI1: Hello World! (seq=1)
    Mon Jul 14 00:09:11 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00067 :   INFO)
     csa103CompI1: Hello World! (seq=2)
    Mon Jul 14 00:09:12 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00068 :   INFO)
     csa103CompI1: Hello World! (seq=3)
    Mon Jul 14 00:09:13 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00069 :   INFO)
     csa103CompI1: Hello World! (seq=4)
      
  3. Next, find the active csa103 process, the one that's printing the Hello World lines and kill it. To find the process ID issue the following command from a bash shell.
     # ps -eaf | grep csa103
    

    This should produce an output that looks similar to the following.

    root     17830 15663  0 14:21 ?        00:00:00 csa103Comp -p
    root     17839 15663  0 14:21 ?        00:00:00 csa103Comp -p
    root     18558 16145  0 14:32 pts/4    00:00:00 grep csa103
    

    Notice the two entries that end with csa103Comp -p. These are our two component processes. The first one is usually the active process. This is the one that we will kill. In this case the process ID is 17830. So to kill the active component you issue the command:

    # kill -9 17830
    

    OpenClovis Note.pngIf this step does not result in the active component being killed then it is likely that the standby component was killed. In this case simply try killing the other process.

    After killing the active component you should see lines in the log files like the following:

    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:16:43 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00515 :   INFO)
     csa103CompI1: Hello World! (seq=452)
    Mon Jul 14 00:16:44 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00516 :   INFO)
     csa103CompI1: Hello World! (seq=453)
    Mon Jul 14 00:16:45 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00517 :   INFO)
     csa103CompI1: Hello World! (seq=454)
    Mon Jul 14 00:16:46 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00518 :   INFO)
     csa103CompI1: Hello World! (seq=455)
      
    /var/log/csa103CompI0Log.latest
    Mon Jul 14 00:16:48 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00065 :   INFO)
     csa103CompI0: Active state requested from state 2
    Mon Jul 14 00:16:48 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00066 :   INFO)
     csa103CompI0 reading checkpoint
    Mon Jul 14 00:16:48 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00067 :   INFO)
     csa103CompI0 read checkpoint: seq = 456
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00068 :   INFO)
     csa103CompI0: Hello World! (seq=456)
    Mon Jul 14 00:16:50 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00069 :   INFO)
     csa103CompI0: Hello World! (seq=457)
    Mon Jul 14 00:16:51 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00070 :   INFO)
     csa103CompI0: Hello World! (seq=458)
      

    Where we can see CompI1 printing 455 and then dieing, where upon CompI0 gets the notice to take over processing, reads the checkpoint and then takes over with seq=456 and so on.

    Then, in the CompI1 log file we see:

    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00044 :   INFO)
     csa103: Instantiated as component instance csa103CompI1.
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00045 :   INFO)
     csa103CompI1: Waiting for CSI assignment...
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00046 :   INFO)
     csa103CompI1: Waiting for CSI assignment...
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00047 :   INFO)
     csa103CompI1: checkpoint_initialize
    Mon Jul 14 00:16:50 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00054 :   INFO)
     csa103CompI1: Checkpoint service initialized (handle=0x1)
    Mon Jul 14 00:16:50 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00056 :   INFO)
     csa103CompI1: Checkpoint opened (handle=0x2)
      

    That is the component that had been killed being restarted.

    CompI0 is moving along just fine, and we see CompI1 come back up. If we then kill CompI0 we see:

    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:29:44 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00065 :   INFO)
     csa103CompI1: Active state requested from state 2
    Mon Jul 14 00:29:44 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00066 :   INFO)
     csa103CompI1 reading checkpoint
    Mon Jul 14 00:29:44 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00067 :   INFO)
     csa103CompI1 read checkpoint: seq = 1225
    Mon Jul 14 00:29:45 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00068 :   INFO)
     csa103CompI1: Hello World! (seq=1225)
    Mon Jul 14 00:29:46 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00069 :   INFO)
     csa103CompI1: Hello World! (seq=1226)
      

    Where we can see the CompI0 process die, and CompI1 process read the sequence number from the checkpoint and then take over from where CompI0 left off.

  4. To stop csa103 use the following SAFplus Platform Console command.
    cli[Test:SCNodeI0:CPM]-> amsLockAssignment sg csa103SGI0
    
  5. Now change the state of csa103SGI0 to LockInstantiation and close the SAFplus Platform Console.
    cli[Test:SCNodeI0:CPM]-> amsLockInstantiation sg csa103SGI0
    cli[Test:SCNodeI0:CPM] -> end
    cli[Test:SCNodeI0] -> end
    cli[Test] -> bye
    

csa203

csa203 demonstrates the usage of SA Forum's Checkpointing service. This sample application does not deviate functionally from csa103. The code differences are due to using SA Forum data types (structures) and APIs , as presented in the following two tables. (Note we have not repeated data types and APIs covered previously.)

SA Forum Data Types with the SAFplus Platform equivalent
SA Forum Data Types OpenClovis Data Types
SaCkptHandleT ClCkptSvcHdlT
SaCkptHandleT ClCkptHdlT
SaCkptSectionIdT ClCkptSectionIdT
SaCkptCheckpointCreationAttributesT ClCkptCheckpointCreationAttributesT
SaCkptSectionCreationAttributesT ClCkptSectionCreationAttributesT
SaCkptIOVectorElementT ClCkptIOVectorElementT



SA Forum APIs with the SAFplus Platform equivalent
SA Forum APIs OpenClovis APIs
saCkptInitialize clCkptInitialize
saCkptCheckpointOpen clCkptCheckpointOpen
saCkptSectionCreate clCkptSectionCreate
saCkptCheckpointClose clCkptCheckpointClose
saCkptFinalize clCkptFinalize
saCkptSectionOverwrite clCkptSectionOverwrite
saCkptCheckpointSynchronize clCkptCheckpointSynchronize
saCkptCheckpointRead clCkptCheckpointRead


How to Run csa103 and What to Observe

This sample application can run on all Runtime Hardware Setups. The following, lists which node csa103compI* runs on.

  • Runtime Hardware Setup 1.1 and 2.1
    csa103compI0 and csa103compI1 will run upon the single node.
  • Runtime Hardware Setup 1.2 and 2.2
    csa103compI0 runs on PayloadNodeI0 and csa103compI1 runs on PayloadNodeI1
  • Runtime Hardware Setup 1.3 and 2.3
    csa103compI2 and csa103compI3 run on SCNodeI0 and SCNodeI1 respectively. csa103compI0 and csa103compI1 run on PayloadNodeI0 and PayloadNodeI1 respectively.

    In the four node set up SCNodeI0 and SCNodeI1 output data via /root/asp/var/log/csa103CompI2 and /root/asp/var/log/csa103CompI3 respectively. Therefore in this case follow the below instructions replacing csa103CompI0 and csa103compI1 with csa103compI2 and csa103compI3 respectively.

We run csa103 very much the same way we run any of the rest of the sample applications: with the SAFplus Platform Console.

  1. First, on the active System Controller move it to state LockAssignment with (Unlock csa203SGI0 instead of csa103SGI0 to run csa203)
    cli[Test]-> setc 1
    cli[Test:SCNodeI0]-> setc cpm
    cli[Test:SCNodeI0:CPM]-> amsLockAssignment sg csa103SGI0
    

    The following output is given when you run tail -f on the csa103 log files. For example:

    # tail -f /root/asp/var/log/csa103CompI0.log
    
    /root/asp/var/log/csa103CompI0Log.latest
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00030 :   INFO)
     Component [csa103CompI0] : PID [15238]. Initializing
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00031 :   INFO)
        IOC Address             : 0x1
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00032 :   INFO)
        IOC Port                : 0x81
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00033 :   INFO)
     csa103: Instantiated as component instance csa103CompI0.
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00034 :   INFO)
     csa103CompI0: Waiting for CSI assignment...
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00035 :   INFO)
     csa103CompI0: Waiting for CSI assignment...
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00036 :   INFO)
     csa103CompI0: checkpoint_initialize
    Mon Jul 14 00:01:10 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00043 :   INFO)
     csa103CompI0: Checkpoint service initialized (handle=0x1)
    Mon Jul 14 00:01:10 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00045 :   INFO)
     csa103CompI0: Checkpoint opened (handle=0x2)
      
    # tail -f /root/asp/var/log/csa103CompI1.log
    
    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00030 :   INFO)
     Component [csa103CompI1] : PID [15234]. Initializing
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00031 :   INFO)
        IOC Address             : 0x1
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00032 :   INFO)
        IOC Port                : 0x80
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00033 :   INFO)
     csa103: Instantiated as component instance csa103CompI1.
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00034 :   INFO)
     csa103CompI1: Waiting for CSI assignment...
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00035 :   INFO)
     csa103CompI1: Waiting for CSI assignment...
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00036 :   INFO)
     csa103CompI1: checkpoint_initialize
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00043 :   INFO)
     csa103CompI1: Checkpoint service initialized (handle=0x1)
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00045 :   INFO)
     csa103CompI1: Checkpoint opened (handle=0x2)
    Mon Jul 14 00:01:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00053 :   INFO)
     csa103CompI1: Section created
      
  2. Then, unlock the application by running
    cli[Test:SCNodeI0:CPM]-> amsUnlock sg csa103SGI0
    

    In the application log files, you should see

    /root/asp/var/log/csa103CompI0Log.latest
    Mon Jul 14 00:09:08 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00058 :   INFO)
      Standby state requested from state 0
      
    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:09:08 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00064 :   INFO)
     csa103CompI1: Active state requested from state 0
    Mon Jul 14 00:09:09 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00065 :   INFO)
     csa103CompI1: Hello World! (seq=0)
    Mon Jul 14 00:09:10 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00066 :   INFO)
     csa103CompI1: Hello World! (seq=1)
    Mon Jul 14 00:09:11 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00067 :   INFO)
     csa103CompI1: Hello World! (seq=2)
    Mon Jul 14 00:09:12 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00068 :   INFO)
     csa103CompI1: Hello World! (seq=3)
    Mon Jul 14 00:09:13 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00069 :   INFO)
     csa103CompI1: Hello World! (seq=4)
      
  3. Next, find the active csa103 process, the one that's printing the Hello World lines and kill it. To find the process ID issue the following command from a bash shell.
     # ps -eaf | grep csa103
    

    This should produce an output that looks similar to the following.

    root     17830 15663  0 14:21 ?        00:00:00 csa103Comp -p
    root     17839 15663  0 14:21 ?        00:00:00 csa103Comp -p
    root     18558 16145  0 14:32 pts/4    00:00:00 grep csa103
    

    Notice the two entries that end with csa103Comp -p. These are our two component processes. The first one is usually the active process. This is the one that we will kill. In this case the process ID is 17830. So to kill the active component you issue the command:

    # kill -9 17830
    

    OpenClovis Note.pngIf this step does not result in the active component being killed then it is likely that the standby component was killed. In this case simply try killing the other process.

    After killing the active component you should see lines in the log files like the following:

    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:16:43 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00515 :   INFO)
     csa103CompI1: Hello World! (seq=452)
    Mon Jul 14 00:16:44 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00516 :   INFO)
     csa103CompI1: Hello World! (seq=453)
    Mon Jul 14 00:16:45 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00517 :   INFO)
     csa103CompI1: Hello World! (seq=454)
    Mon Jul 14 00:16:46 2008   (SCNodeI0.15234 : csa103CompEO.---.---.00518 :   INFO)
     csa103CompI1: Hello World! (seq=455)
      
    /var/log/csa103CompI0Log.latest
    Mon Jul 14 00:16:48 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00065 :   INFO)
     csa103CompI0: Active state requested from state 2
    Mon Jul 14 00:16:48 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00066 :   INFO)
     csa103CompI0 reading checkpoint
    Mon Jul 14 00:16:48 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00067 :   INFO)
     csa103CompI0 read checkpoint: seq = 456
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00068 :   INFO)
     csa103CompI0: Hello World! (seq=456)
    Mon Jul 14 00:16:50 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00069 :   INFO)
     csa103CompI0: Hello World! (seq=457)
    Mon Jul 14 00:16:51 2008   (SCNodeI0.15238 : csa103CompEO.---.---.00070 :   INFO)
     csa103CompI0: Hello World! (seq=458)
      

    Where we can see CompI1 printing 455 and then dieing, where upon CompI0 gets the notice to take over processing, reads the checkpoint and then takes over with seq=456 and so on.

    Then, in the CompI1 log file we see:

    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00044 :   INFO)
     csa103: Instantiated as component instance csa103CompI1.
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00045 :   INFO)
     csa103CompI1: Waiting for CSI assignment...
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00046 :   INFO)
     csa103CompI1: Waiting for CSI assignment...
    Mon Jul 14 00:16:49 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00047 :   INFO)
     csa103CompI1: checkpoint_initialize
    Mon Jul 14 00:16:50 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00054 :   INFO)
     csa103CompI1: Checkpoint service initialized (handle=0x1)
    Mon Jul 14 00:16:50 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00056 :   INFO)
     csa103CompI1: Checkpoint opened (handle=0x2)
      

    That is the component that had been killed being restarted.

    CompI0 is moving along just fine, and we see CompI1 come back up. If we then kill CompI0 we see:

    /root/asp/var/log/csa103CompI1Log.latest
    Mon Jul 14 00:29:44 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00065 :   INFO)
     csa103CompI1: Active state requested from state 2
    Mon Jul 14 00:29:44 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00066 :   INFO)
     csa103CompI1 reading checkpoint
    Mon Jul 14 00:29:44 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00067 :   INFO)
     csa103CompI1 read checkpoint: seq = 1225
    Mon Jul 14 00:29:45 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00068 :   INFO)
     csa103CompI1: Hello World! (seq=1225)
    Mon Jul 14 00:29:46 2008   (SCNodeI0.15553 : csa103CompEO.---.---.00069 :   INFO)
     csa103CompI1: Hello World! (seq=1226)
      

    Where we can see the CompI0 process die, and CompI1 process read the sequence number from the checkpoint and then take over from where CompI0 left off.

  4. To stop csa103 use the following SAFplus Platform Console command.
    cli[Test:SCNodeI0:CPM]-> amsLockAssignment sg csa103SGI0
    
  5. Now change the state of csa103SGI0 to LockInstantiation and close the SAFplus Platform Console.
    cli[Test:SCNodeI0:CPM]-> amsLockInstantiation sg csa103SGI0
    cli[Test:SCNodeI0:CPM] -> end
    cli[Test:SCNodeI0] -> end
    cli[Test] -> bye
    

Additional Tests for Runtime Hardware Setup 1.3 and 2.3

Now repeat the experiment till the stage before we kill the active process and follow these steps.

Step A

  1. Stop SAFplus Platform on SCNodeI0 using /etc/init.d/asp stop. Observer the logs /root/asp/var/log/csa103CompI3Log.latest on SCNodeI1. This will become active and start printing the hello world logs as above. Note in this case active System Controller SCNode1 is still aware of PayloadNodeI0 and PayloadNodeI1.
  2. Start looking at the logs for /root/asp/var/log/csa103CompI0Log.latest in PayloadNodeI0. This will print the standby logs as above.
  3. Stop SAFplus Platform on PayloadNodeI0 and start observing the logs on PayloadNodeI1 in /root/asp/var/log/csa103CompI1Log.latest for standby. In this case, you can see, via the active System Controller logs that PayloadNodeI0 is not active whilst PayloadNodeI1 is.

Step B For Hardware Setup 1.3 only

  1. Repeat A with the slight exception that the blade running SCNodeI0 is yanked out(/etc/init.d/asp zap) instead of gracefully shutting down SAFplus Platform in step 1.

Summary and References

We've seen :

  • how to use Clovis' checkpoint service to save some program state and recover that state from a separate process
  • how to initialize the client checkpoint library
  • how to create and open checkpoints
  • how to create and use sections in an opened checkpoint

Further information can be found within the following: OpenClovis API Reference Guide.