OpenClovis Product Documentation

Revision as of 19:49, 8 January 2013

Debugging Applications and SAFplus Platform Components

The high availability features of the SAFplus Platform interfere with debugging applications because the act of debugging an application breaks the performance and availability "contract" between the SAFplus Platform and an application and therefore causes the SAFplus Platform to restart the application.

This HOWTO describes strategies that can be used to debug within an high availability environment. It is not a primer on gdb. In fact, it assumes an in-depth knowledge of debugging using gdb (or gdb derivatives such as Wind River's IDE) and therefore focuses exclusively on the interactions between the debugger and the SAFplus Platform.

It is useful to toggle the following debugging interventions on an as needed basis since many "defeat" the high availability features of the SAFplus Platform. Therefore the examples below use a global variable (defined in your component's clCompAppMain.c) called debuggingOn to control these features.

int debuggingOn = 1; /* Set to 0 to turn debugging off, 1 to turn it on */

These features will be generally available and integrated into the SAFplus Platform in the upcoming release.

Stopping Process Health Monitoring

It is extremely irritating to attach to your process via a debugger only to have the SAFplus's high availability features intervene and kill the process you are trying to debug! Certain techniques can be used to temporarily "defeat" this feature to allow debugging to occur. Of course, while the feature is "off" the high availability features will not work.

Turning off heartbeat

SAFplus platform running over the TIPC communications infrastructure will automatically detect process death via TIPC notifications and dropping into a debugger breakpoint will not cause keepalive failure.

However, if you have enabled SA-Forum compliant application level health checking then you will run into this issue. In that case, the SAFplus Platform expects your process to periodically state that it is "OK"; if it does not (for example, if stopped in a debugger) the SAFplus Platform will kill your process.

Replace the function in clCompAppMain.c (in every one of your components) with this function, and create a "debuggingOn" global variable. This will turn off the polling of your component and so the SAFplus Platform will not kill your component if it is being debugged. This will allow you to load up gdb and then "attach" to your component.

int debuggingOn = 1;

ClRcT clCompAppHealthCheck(ClEoSchedFeedBackT* schFeedback)
{
   /*
      Add code for application specific health check below. The defaults
      indicate EO is healthy and polling interval is unaltered.
   */

   ClRcT rc = CL_OK;

   /* If you want to attach to gdb */
   if (debuggingOn) schFeedback->freq   = CL_EO_DONT_POLL;  
   else 
     {
       schFeedback->freq   = CL_EO_DEFAULT_POLL; 
       schFeedback->status = CL_CPM_EO_ALIVE;
       rc = eoapp->HealthCheck();
     }

   return rc;
}

"ACK"ing Messages Early

The SAFplus Platform also monitors the response of certain crucial messages to your component. If your component does not respond in time, the SAFplus Platform kills your component. This feature ensures that your application does not have a bug in the handling of these messages; however when debugging the message handling we must "defeat" this feature. The most common messages are the CSI (work) assignment messages, which trigger the "clCompAppAMFCSISet" (clCompAppMain.c) function call. When debugging your component, it is best to "ACK" the message right away so the SAFplus Platform will think that you have handled the message. However, it is important to NOT do so when not in debugging mode, to ensure that the HA features work.

In the routine below, note that the "clCpmResponse(cpmHandle, invocation, CL_OK);" function has been moved above the switch statement when debugging is on (and moved to below it, for succinctness, when debugging is off).

ClRcT clCompAppAMFCSISet(
   ClInvocationT       invocation,
   const ClNameT       *compName,
   ClAmsHAStateT       haState,
   ClAmsCSIDescriptorT csiDescriptor)
{ 
   /* If component hang detection is off, then reply with an OK first, 
      so that this call won't time out causing the controller to kill me. */
   if (debuggingOn)
     {
       if (haState != CL_AMS_HA_STATE_QUIESCING)
         clCpmResponse(cpmHandle, invocation, CL_OK);
     }
 
   /* DO NOT SET BREAKPOINTS ABOVE THIS LINE (IN THIS FUNCTION)

      (the controller will think that your process is dead)

      To set a breakpoint in this routine:

      1. First set "debuggingOn" to 1
      2. Set your breakpoint below this comment
      3. Continue
   */  


   /* Print information about the CSI Set */
   clprintf ("Component [%s] : PID [%d]. CSI Set Received\n", 
              compName->value, mypid);
   clCompAppAMFPrintCSI(csiDescriptor, haState);


   /* Take appropriate action based on state */
   switch ( haState )
   {
       case CL_AMS_HA_STATE_ACTIVE:
       {
           /*
              AMF has requested application to take the active HA state 
              for the CSI.
           */
           break;
       }
       case CL_AMS_HA_STATE_STANDBY:
       {
           /*
              AMF has requested application to take the standby HA state 
              for this CSI.
           */
           break;
       }
       case CL_AMS_HA_STATE_QUIESCED:
       {
           /*
              AMF has requested application to quiesce the CSI currently
              assigned the active or quiescing HA state. The application 
              must stop work associated with the CSI immediately.
           */
           break;
       }
       case CL_AMS_HA_STATE_QUIESCING:
       {
           /*
              AMF has requested application to quiesce the CSI currently
              assigned the active HA state. The application must stop work
              associated with the CSI gracefully and not accept any new
              workloads while the work is being terminated.
           */
           clCpmCSIQuiescingComplete(cpmHandle, invocation, CL_OK);
           break;
       }
       default:
       {
           /*
              Should never happen. Ignore.
           */
       }
   }

   /* If I'm am killing components, then reply with an OK after
      execution, to ensure that the handling did not hang */
   if (!debuggingOn)
     {
       if (haState != CL_AMS_HA_STATE_QUIESCING)
         clCpmResponse(cpmHandle, invocation, CL_OK);
     }
   return CL_OK;
}

Component State Transition Timeouts

The SAFplus Platform also ensures that your component transitions through various states within reasonable a time. To debug during these transitions (startup, shutdown, etc) it is necessary to increase these timeouts. These timeouts are available in the IDE, or directly in the XML configuration files for temporary changes.

Search for the following XML in your "etc/clAmfDefinitions.xml" file and change the numbers for every component you need to debug to large values.

          <timeouts>
            <instantiateTimeout>10000</instantiateTimeout>
            <terminateTimeout>10000</terminateTimeout>
            <cleanupTimeout>10000</cleanupTimeout>
            <amStartTimeout>10000</amStartTimeout>
            <amStopTimeout>10000</amStopTimeout>
            <quiescingCompleteTimeout>10000</quiescingCompleteTimeout>
            <csiSetTimeout>10000</csiSetTimeout>
            <csiRemoveTimeout>10000</csiRemoveTimeout>
            <proxiedCompInstantiateTimeout>10000</proxiedCompInstantiateTimeout>
            <proxiedCompCleanupTimeout>10000</proxiedCompCleanupTimeout>
          </timeouts>

Creating core files on failure

Core files are configured to be automatically created and will be found int the "var/run" directory. If core files are not being created, your environment configuration may be overriding our changes.

Make sure the core file size is "unlimited" by running "ulimit -c unlimited" before executing SAFplus.

You can test core creation by jamming a SEGFAULT into a program by doing "kill -SEGV <process id>".

Pausing and Resuming a Thread

Sometimes a problem occurs during startup -- before you have the opportunity to attach to a process, set a breakpoint, and resume the process. In a standard application this is easy to debug because "gdb" starts up with your program "stopped". But in a high availability framework, where the SAFplus Platform starts and stops programs, this is trickier.

To debug this, 2 functions are available from "clDebugApi.h", one which "pauses" the calling thread, the other "resumes" it. The programmer adds the "pause" function to the code and recompiles. When "debuggingOn" is true and the function is called, the thread will output a log and block on a semaphore. The programmer can then check the logs to see if a thread was paused. If so, he goes into gdb, attaches to the process, finds the thread that is paused -- "thread apply all backtrace" ("thr a a bt") and look for a call to "pause" -- switches to that thread ("thread x"), and then "resumes" the thread by calling the "resume" function from gdb's prompt:

gdb> call clDbgResume()

The "pause" function is actually a macro so that I can also pass the file and line from which the function was called. This information is very useful if you accidentally forget to remove a call to clDbgPause...

Intercepting the Automatic Start of a Component

It is possible to run your component within a wrapper application (such as gdb or purify). When the SAFplus Platform is deployed on a target, all applications are placed in the "bin" subdirectory of the SAFplus Platform tree (for example "/root/asp/bin"). To intercept the application start for debugging purposes:

rename your application
create a script and name it the same as the original application name
edit the script to add your wrapper.
1. Note that you will have to start a separate window for applications that need a console (like gdb)

For example the following simple script can be used to make your application start within gdb (note though that it must be used in conjunction with the techniques described in the "Stopping Process Health Monitoring" section above unless you are a very fast typist, since your application's initialization will be monitored by the OpenClovis HA infrastructure to ensure that it succeeds). This script assumes that your application is called "myapp" and that you moved it to "/root/asp/bin/myapp.orig". It also assumes that your Linux installation has X-windows and the gnome desktop environment installed (however, a similar technique can be used for other window managers):

#!/bin/bash
/usr/bin/gnome-terminal --window --command "gdb /root/asp/bin/myapp.orig $1 $2 $3 $4 $5 $6 $7 $8 $9"

@@ Line 1: / Line 1: @@
-==Debugging EO Applications and SAFplus Platform Components==
+==Debugging Applications and SAFplus Platform Components==
 The high availability features of the SAFplus Platform interfere with debugging applications because the act of debugging an application breaks the performance and availability "contract" between the SAFplus Platform and an application and therefore causes the SAFplus Platform to restart the application.
@@ Line 17: / Line 17: @@
 ==== Turning off heartbeat  ====
-Turning off your component's heartbeat solves 95% of the problem.  The SAFplus Platform expects your process to periodically state that it is "OK"; if it does not (for example, if stopped in a debugger) the SAFplus Platform will kill your process.
+SAFplus platform running over the TIPC communications infrastructure will automatically detect process death via TIPC notifications and dropping into a debugger breakpoint will not cause keepalive failure.
+However, if you have enabled SA-Forum compliant application level health checking then you will run into this issue.  In that case, the SAFplus Platform expects your process to periodically state that it is "OK"; if it does not (for example, if stopped in a debugger) the SAFplus Platform will kill your process.
 Replace the function in clCompAppMain.c (in every one of your components) with this function, and create a "debuggingOn" global variable.  This will turn off the polling of your component and so the SAFplus Platform will not kill your component if it is being debugged.  This will allow you to load up gdb and then "attach" to your component.
@@ Line 46: / Line 48: @@
 ==== "ACK"ing Messages Early  ====
-The SAFplus Platform also monitors the response of certain crutial messages to your component.  If your component does not respond in time, the SAFplus Platform kills your component.  This feature ensures that your application does not have a bug in the handling of these messages; however when debugging the message handling we must "defeat" this feature.  The most common messages are the CSI assignment messages, which trigger the "clCompAppAMFCSISet" (clCompAppMain.c) function call.  When debugging your component, it is best to "ACK" the message right away so the SAFplus Platform will think that you have handled the message.  However, it is important to NOT do so when not in debugging mode, to ensure that the HA features work.
+The SAFplus Platform also monitors the response of certain crucial messages to your component.  If your component does not respond in time, the SAFplus Platform kills your component.  This feature ensures that your application does not have a bug in the handling of these messages; however when debugging the message handling we must "defeat" this feature.  The most common messages are the CSI (work) assignment messages, which trigger the "clCompAppAMFCSISet" (clCompAppMain.c) function call.  When debugging your component, it is best to "ACK" the message right away so the SAFplus Platform will think that you have handled the message.  However, it is important to NOT do so when not in debugging mode, to ensure that the HA features work.
 In the routine below, note that the "clCpmResponse(cpmHandle, invocation, CL_OK);" function has been moved above the switch statement when debugging is on (and moved to below it, for succinctness, when debugging is off).
@@ Line 141: / Line 143: @@
 ==== Component State Transition Timeouts ====
-The SAFplus Platform also ensures that your component transitions through various states within reasonable a time.  To debug during these transitions (startup, shutdown, etc) it is necessary to increase these timeouts.
+The SAFplus Platform also ensures that your component transitions through various states within reasonable a time.  To debug during these transitions (startup, shutdown, etc) it is necessary to increase these timeouts.  These timeouts are available in the IDE, or directly in the XML configuration files for temporary changes.
-The code that must be changed is in function cpmCompConfigure() in SAFplus/components/amf/server/cpm/clCpmComponent.c.  In each timeout there is a "tsSec" variable that is set to zero since the configured value is specified in milliseconds.  Instead of setting it to zero, set it to a global variable (in this case "clDbgComTimeoutOverride").  When debugging, set this variable to a large number so that the timeout essentially never fires.
+Search for the following XML in your "etc/clAmfDefinitions.xml" file and change the numbers for every component you need to debug to large values.
-Here is the changed code:
 <code><pre>
-/*
+          <timeouts>
-   Create One shot timer component instantiation, with a timeout value
+            <instantiateTimeout>10000</instantiateTimeout>
-   equal to comp->compConfig->compInstantiateTimeout
+            <terminateTimeout>10000</terminateTimeout>
-*/
+            <cleanupTimeout>10000</cleanupTimeout>
-cpmTimeOut.tsSec = clDbgCompTimeoutOverride; /* default is zero, */
+            <amStartTimeout>10000</amStartTimeout>
-/* this override help debug apps, since the app won't be killed
+            <amStopTimeout>10000</amStopTimeout>
-   if you set the # high. */
+            <quiescingCompleteTimeout>10000</quiescingCompleteTimeout>
-cpmTimeOut.tsMilliSec = tmpComp->compConfig->compInstantiateTimeout;
+            <csiSetTimeout>10000</csiSetTimeout>
-rc = clTimerCreate(cpmTimeOut, CL_TIMER_ONE_SHOT, CL_TIMER_SEPARATE_CONTEXT,
+            <csiRemoveTimeout>10000</csiRemoveTimeout>
-                   cpmTimerCallback, (void *) tmpComp,
+            <proxiedCompInstantiateTimeout>10000</proxiedCompInstantiateTimeout>
-                   &(tmpComp->cpmInstantiateTimerHandle));
+            <proxiedCompCleanupTimeout>10000</proxiedCompCleanupTimeout>
-/*
+          </timeouts>
-   Create One shot timer for component termination, with a timeout value
-   equal to comp->compConfig->compTerminateCallbackTimeOut
-*/
-cpmTimeOut.tsSec = clDbgCompTimeoutOverride; /* default is zero, */
-/* this override help debug apps, since the app won't be killed
-   if you set the # high. */
-cpmTimeOut.tsMilliSec = tmpComp->compConfig->compTerminateCallbackTimeOut;
-rc = clTimerCreate(cpmTimeOut, CL_TIMER_ONE_SHOT, CL_TIMER_SEPARATE_CONTEXT,
-                   cpmTimerCallback, (void *) tmpComp,
-                   &(tmpComp->cpmTerminateTimerHandle));
-/*
-   FIXME: following two need should be created only if this component is
-   proxy for a component
-*/
-/*
-   Create One shot timer proxied component Instantiate, with a timeout
-   value equal to comp->compConfig->compInstantiateTimeout
-*/
-cpmTimeOut.tsSec = clDbgCompTimeoutOverride; /* default is zero, */
-/* this override help debug apps, since the app won't be killed
-   if you set the # high. */
-cpmTimeOut.tsMilliSec =
-    tmpComp->compConfig->compProxiedCompInstantiateCallbackTimeout;
-rc = clTimerCreate(cpmTimeOut, CL_TIMER_ONE_SHOT, CL_TIMER_SEPARATE_CONTEXT,
-                   cpmTimerCallback, (void *) tmpComp,
-                   &(tmpComp->cpmProxiedInstTimerHandle));
-/*
-   Create One shot timer for proxied component cleanup, with a timeout
-   value equal to comp->compConfig->compInstantiateTimeout
-*/
-cpmTimeOut.tsSec = clDbgCompTimeoutOverride; /* default is zero, */
-/* this override help debug apps, since the app won't be killed
-   if you set the # high. */
-cpmTimeOut.tsMilliSec =
-    tmpComp->compConfig->compProxiedCompCleanupCallbackTimeout;
-rc = clTimerCreate(cpmTimeOut, CL_TIMER_ONE_SHOT, CL_TIMER_SEPARATE_CONTEXT,
-                   cpmTimerCallback, (void *) tmpComp,
-                   &(tmpComp->cpmProxiedCleanupTimerHandle));
 </pre></code>
 === Creating core files on failure ===
-To create core files, you need to change the core file size limit.  Do this by adding "ulimit -c unlimited" to the "asp/etc/init.d/sisp" script just before the sisp_amf program is run in the "start()" function.  The following code shows the change:
+Core files are configured to be automatically created and will be found int the "var/run" directory.  If core files are not being created, your environment configuration may be overriding our changes.
-        cd $SISP_DIR/bin
+Make sure the core file size is "unlimited" by running "ulimit -c unlimited" before executing SAFplus.
-        ulimit -c unlimited
-        env $SISP_ENV ./$prog $SISP_OPTIONS > ${SISP_OUTPUT} 2>&1 &
-        RETVAL=$?
-Core files will be created in the "bin" (/root/asp/bin) directory.  You can test core creation by jamming a SEGFAULT into a program by doing "kill -SEGV <process id>".
+You can test core creation by jamming a SEGFAULT into a program by doing "kill -SEGV <process id>".
-'''Note that when you redeploy (i.e. detar your image tarball onto the target machine) this file is overwritten!'''  I just keep a copy in my home directory and copy it back after deployment.
 === Pausing and Resuming a Thread ===
@@ Line 216: / Line 176: @@
 Sometimes a problem occurs during startup -- before you have the opportunity to attach to a process, set a breakpoint, and resume the process.    In a standard application this is easy to debug because "gdb" starts up with your program "stopped".  But in a high availability framework, where the SAFplus Platform starts and stops programs, this is trickier.
-To debug this, I added 2 functions, one which "pauses" the calling thread, the other "resumes" it.  The programmer adds the "pause" function to the code and recompiles.  When "debuggingOn" is true and the function is called, the thread will output a log and block on a semaphore.  The programmer can then check the logs to see if a thread was paused.  If so, he goes into gdb, attaches to the process, finds the thread that is paused  -- "thread apply all backtrace" ("thr a a bt") and look for a call to "pause" -- switches to that thread ("thread x"), and then "resumes" the thread by calling the "resume" function from gdb's prompt:
+To debug this, 2 functions are available from "clDebugApi.h", one which "pauses" the calling thread, the other "resumes" it.  The programmer adds the "pause" function to the code and recompiles.  When "debuggingOn" is true and the function is called, the thread will output a log and block on a semaphore.  The programmer can then check the logs to see if a thread was paused.  If so, he goes into gdb, attaches to the process, finds the thread that is paused  -- "thread apply all backtrace" ("thr a a bt") and look for a call to "pause" -- switches to that thread ("thread x"), and then "resumes" the thread by calling the "resume" function from gdb's prompt:
   gdb> call clDbgResume()
-The "pause" function is actually a macro so that I can also pass the file and line from which the function was called.  This information is very useful if you accidently forget to remove a call to clDbgPause!
+The "pause" function is actually a macro so that I can also pass the file and line from which the function was called.  This information is very useful if you accidentally forget to remove a call to clDbgPause...
- #define clDbgPause() clDbgPauseFn(__FILE__,__LINE__)
- sem_t  clDbgSem;
- int    clDbgSemInited           = 0;
- /* Call this from the code to pause your program. */
- void clDbgPauseFn(char* file, int line)
- {
-  if (debuggingOn != 0)
-    {
-      if (!clDbgSemInited)
-        {
-          int rc;
-          clDbgSemInited = 1;
-          rc = sem_init(&clDbgSem,0,0);
-          if (rc == 0)
-            {
-              CL_DEBUG_PRINT(CL_DEBUG_CRITICAL,
-                 ("dbgPause function called from %s:%d. Thread is paused."
-                  "Open the debugger and execute dbgResume() to continue.",
-                  file,line));
-              do {
-                rc = sem_wait(&clDbgSem);
-              } while ((rc == -1)&&(errno == EINTR));
-            }
-          else
-            {
-              int err = errno;
-              CL_DEBUG_PRINT(CL_DEBUG_CRITICAL,
-                 ("Can't create debug semaphore, error: %s, errno %d.",
-                  strerror(err), err));
-            }
-        }
-    }
- }
- /* Call this from gdb to continue your program. */
- void clDbgResume()
- {
-   sem_post(&clDbgSem);
- }
 === Intercepting the Automatic Start of a Component ===
@@ Line 272: / Line 191: @@
 ## Note that you will have to start a separate window for applications that need a console (like gdb)
-For example the following simple script can be used to make your application start within gdb (note though that it must be used in conjunction with the techniques described in the "Stopping Process Health Monitoring" section above unless you are a very fast typist, since your application's initialization will be monitored by the OpenClovis HA infrastructure to ensure that it succeeds). This script assumes that your application is called "myapp" and that you moved it to "/root/asp/bin/myapp.orig".  It also assumes that your Linux installation has X-windows and the gnome desktop environment installed:
+For example the following simple script can be used to make your application start within gdb (note though that it must be used in conjunction with the techniques described in the "Stopping Process Health Monitoring" section above unless you are a very fast typist, since your application's initialization will be monitored by the OpenClovis HA infrastructure to ensure that it succeeds). This script assumes that your application is called "myapp" and that you moved it to "/root/asp/bin/myapp.orig".  It also assumes that your Linux installation has X-windows and the gnome desktop environment installed (however, a similar technique can be used for other window managers):
 <code><pre>
 #!/bin/bash
-/usr/bin/gnome-terminal --window --command \
+/usr/bin/gnome-terminal --window --command "gdb /root/asp/bin/myapp.orig $1 $2 $3 $4 $5 $6 $7 $8 $9"
-"gdb /root/asp/bin/myapp.orig $1 $2 $3 $4 $5 $6 $7 $8 $9"
 </pre></code>