m (1 revision) |
(→Debugging EO Applications and SAFplus Platform Components) |
||
Line 1: | Line 1: | ||
− | ==Debugging | + | ==Debugging Applications and SAFplus Platform Components== |
The high availability features of the SAFplus Platform interfere with debugging applications because the act of debugging an application breaks the performance and availability "contract" between the SAFplus Platform and an application and therefore causes the SAFplus Platform to restart the application. | The high availability features of the SAFplus Platform interfere with debugging applications because the act of debugging an application breaks the performance and availability "contract" between the SAFplus Platform and an application and therefore causes the SAFplus Platform to restart the application. | ||
Line 17: | Line 17: | ||
==== Turning off heartbeat ==== | ==== Turning off heartbeat ==== | ||
− | + | SAFplus platform running over the TIPC communications infrastructure will automatically detect process death via TIPC notifications and dropping into a debugger breakpoint will not cause keepalive failure. | |
+ | |||
+ | However, if you have enabled SA-Forum compliant application level health checking then you will run into this issue. In that case, the SAFplus Platform expects your process to periodically state that it is "OK"; if it does not (for example, if stopped in a debugger) the SAFplus Platform will kill your process. | ||
Replace the function in clCompAppMain.c (in every one of your components) with this function, and create a "debuggingOn" global variable. This will turn off the polling of your component and so the SAFplus Platform will not kill your component if it is being debugged. This will allow you to load up gdb and then "attach" to your component. | Replace the function in clCompAppMain.c (in every one of your components) with this function, and create a "debuggingOn" global variable. This will turn off the polling of your component and so the SAFplus Platform will not kill your component if it is being debugged. This will allow you to load up gdb and then "attach" to your component. | ||
Line 46: | Line 48: | ||
==== "ACK"ing Messages Early ==== | ==== "ACK"ing Messages Early ==== | ||
− | The SAFplus Platform also monitors the response of certain | + | The SAFplus Platform also monitors the response of certain crucial messages to your component. If your component does not respond in time, the SAFplus Platform kills your component. This feature ensures that your application does not have a bug in the handling of these messages; however when debugging the message handling we must "defeat" this feature. The most common messages are the CSI (work) assignment messages, which trigger the "clCompAppAMFCSISet" (clCompAppMain.c) function call. When debugging your component, it is best to "ACK" the message right away so the SAFplus Platform will think that you have handled the message. However, it is important to NOT do so when not in debugging mode, to ensure that the HA features work. |
In the routine below, note that the "clCpmResponse(cpmHandle, invocation, CL_OK);" function has been moved above the switch statement when debugging is on (and moved to below it, for succinctness, when debugging is off). | In the routine below, note that the "clCpmResponse(cpmHandle, invocation, CL_OK);" function has been moved above the switch statement when debugging is on (and moved to below it, for succinctness, when debugging is off). | ||
Line 141: | Line 143: | ||
==== Component State Transition Timeouts ==== | ==== Component State Transition Timeouts ==== | ||
− | The SAFplus Platform also ensures that your component transitions through various states within reasonable a time. To debug during these transitions (startup, shutdown, etc) it is necessary to increase these timeouts. | + | The SAFplus Platform also ensures that your component transitions through various states within reasonable a time. To debug during these transitions (startup, shutdown, etc) it is necessary to increase these timeouts. These timeouts are available in the IDE, or directly in the XML configuration files for temporary changes. |
− | + | Search for the following XML in your "etc/clAmfDefinitions.xml" file and change the numbers for every component you need to debug to large values. | |
− | |||
<code><pre> | <code><pre> | ||
− | + | <timeouts> | |
− | + | <instantiateTimeout>10000</instantiateTimeout> | |
− | + | <terminateTimeout>10000</terminateTimeout> | |
− | + | <cleanupTimeout>10000</cleanupTimeout> | |
− | + | <amStartTimeout>10000</amStartTimeout> | |
− | / | + | <amStopTimeout>10000</amStopTimeout> |
− | + | <quiescingCompleteTimeout>10000</quiescingCompleteTimeout> | |
− | + | <csiSetTimeout>10000</csiSetTimeout> | |
− | + | <csiRemoveTimeout>10000</csiRemoveTimeout> | |
− | + | <proxiedCompInstantiateTimeout>10000</proxiedCompInstantiateTimeout> | |
− | + | <proxiedCompCleanupTimeout>10000</proxiedCompCleanupTimeout> | |
− | / | + | </timeouts> |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | / | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | / | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
</pre></code> | </pre></code> | ||
+ | |||
=== Creating core files on failure === | === Creating core files on failure === | ||
− | + | Core files are configured to be automatically created and will be found int the "var/run" directory. If core files are not being created, your environment configuration may be overriding our changes. | |
− | + | Make sure the core file size is "unlimited" by running "ulimit -c unlimited" before executing SAFplus. | |
− | + | ||
− | + | ||
− | + | ||
− | + | You can test core creation by jamming a SEGFAULT into a program by doing "kill -SEGV <process id>". | |
− | |||
=== Pausing and Resuming a Thread === | === Pausing and Resuming a Thread === | ||
Line 216: | Line 176: | ||
Sometimes a problem occurs during startup -- before you have the opportunity to attach to a process, set a breakpoint, and resume the process. In a standard application this is easy to debug because "gdb" starts up with your program "stopped". But in a high availability framework, where the SAFplus Platform starts and stops programs, this is trickier. | Sometimes a problem occurs during startup -- before you have the opportunity to attach to a process, set a breakpoint, and resume the process. In a standard application this is easy to debug because "gdb" starts up with your program "stopped". But in a high availability framework, where the SAFplus Platform starts and stops programs, this is trickier. | ||
− | To debug this, | + | To debug this, 2 functions are available from "clDebugApi.h", one which "pauses" the calling thread, the other "resumes" it. The programmer adds the "pause" function to the code and recompiles. When "debuggingOn" is true and the function is called, the thread will output a log and block on a semaphore. The programmer can then check the logs to see if a thread was paused. If so, he goes into gdb, attaches to the process, finds the thread that is paused -- "thread apply all backtrace" ("thr a a bt") and look for a call to "pause" -- switches to that thread ("thread x"), and then "resumes" the thread by calling the "resume" function from gdb's prompt: |
gdb> call clDbgResume() | gdb> call clDbgResume() | ||
− | The "pause" function is actually a macro so that I can also pass the file and line from which the function was called. This information is very useful if you | + | The "pause" function is actually a macro so that I can also pass the file and line from which the function was called. This information is very useful if you accidentally forget to remove a call to clDbgPause... |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== Intercepting the Automatic Start of a Component === | === Intercepting the Automatic Start of a Component === | ||
Line 272: | Line 191: | ||
## Note that you will have to start a separate window for applications that need a console (like gdb) | ## Note that you will have to start a separate window for applications that need a console (like gdb) | ||
− | For example the following simple script can be used to make your application start within gdb (note though that it must be used in conjunction with the techniques described in the "Stopping Process Health Monitoring" section above unless you are a very fast typist, since your application's initialization will be monitored by the OpenClovis HA infrastructure to ensure that it succeeds). This script assumes that your application is called "myapp" and that you moved it to "/root/asp/bin/myapp.orig". It also assumes that your Linux installation has X-windows and the gnome desktop environment installed: | + | For example the following simple script can be used to make your application start within gdb (note though that it must be used in conjunction with the techniques described in the "Stopping Process Health Monitoring" section above unless you are a very fast typist, since your application's initialization will be monitored by the OpenClovis HA infrastructure to ensure that it succeeds). This script assumes that your application is called "myapp" and that you moved it to "/root/asp/bin/myapp.orig". It also assumes that your Linux installation has X-windows and the gnome desktop environment installed (however, a similar technique can be used for other window managers): |
<code><pre> | <code><pre> | ||
#!/bin/bash | #!/bin/bash | ||
− | /usr/bin/gnome-terminal --window --command | + | /usr/bin/gnome-terminal --window --command "gdb /root/asp/bin/myapp.orig $1 $2 $3 $4 $5 $6 $7 $8 $9" |
− | "gdb /root/asp/bin/myapp.orig $1 $2 $3 $4 $5 $6 $7 $8 $9" | + | |
</pre></code> | </pre></code> |
Revision as of 19:49, 8 January 2013
Contents |
Debugging Applications and SAFplus Platform Components
The high availability features of the SAFplus Platform interfere with debugging applications because the act of debugging an application breaks the performance and availability "contract" between the SAFplus Platform and an application and therefore causes the SAFplus Platform to restart the application.
This HOWTO describes strategies that can be used to debug within an high availability environment. It is not a primer on gdb. In fact, it assumes an in-depth knowledge of debugging using gdb (or gdb derivatives such as Wind River's IDE) and therefore focuses exclusively on the interactions between the debugger and the SAFplus Platform.
It is useful to toggle the following debugging interventions on an as needed basis since many "defeat" the high availability features of the SAFplus Platform. Therefore the examples below use a global variable (defined in your component's clCompAppMain.c) called debuggingOn to control these features.
int debuggingOn = 1; /* Set to 0 to turn debugging off, 1 to turn it on */
These features will be generally available and integrated into the SAFplus Platform in the upcoming release.
Stopping Process Health Monitoring
It is extremely irritating to attach to your process via a debugger only to have the SAFplus's high availability features intervene and kill the process you are trying to debug! Certain techniques can be used to temporarily "defeat" this feature to allow debugging to occur. Of course, while the feature is "off" the high availability features will not work.
Turning off heartbeat
SAFplus platform running over the TIPC communications infrastructure will automatically detect process death via TIPC notifications and dropping into a debugger breakpoint will not cause keepalive failure.
However, if you have enabled SA-Forum compliant application level health checking then you will run into this issue. In that case, the SAFplus Platform expects your process to periodically state that it is "OK"; if it does not (for example, if stopped in a debugger) the SAFplus Platform will kill your process.
Replace the function in clCompAppMain.c (in every one of your components) with this function, and create a "debuggingOn" global variable. This will turn off the polling of your component and so the SAFplus Platform will not kill your component if it is being debugged. This will allow you to load up gdb and then "attach" to your component.
int debuggingOn = 1;
ClRcT clCompAppHealthCheck(ClEoSchedFeedBackT* schFeedback) { /* Add code for application specific health check below. The defaults indicate EO is healthy and polling interval is unaltered. */ ClRcT rc = CL_OK; /* If you want to attach to gdb */ if (debuggingOn) schFeedback->freq = CL_EO_DONT_POLL; else { schFeedback->freq = CL_EO_DEFAULT_POLL; schFeedback->status = CL_CPM_EO_ALIVE; rc = eoapp->HealthCheck(); } return rc; }
"ACK"ing Messages Early
The SAFplus Platform also monitors the response of certain crucial messages to your component. If your component does not respond in time, the SAFplus Platform kills your component. This feature ensures that your application does not have a bug in the handling of these messages; however when debugging the message handling we must "defeat" this feature. The most common messages are the CSI (work) assignment messages, which trigger the "clCompAppAMFCSISet" (clCompAppMain.c) function call. When debugging your component, it is best to "ACK" the message right away so the SAFplus Platform will think that you have handled the message. However, it is important to NOT do so when not in debugging mode, to ensure that the HA features work.
In the routine below, note that the "clCpmResponse(cpmHandle, invocation, CL_OK);" function has been moved above the switch statement when debugging is on (and moved to below it, for succinctness, when debugging is off).
ClRcT clCompAppAMFCSISet( ClInvocationT invocation, const ClNameT *compName, ClAmsHAStateT haState, ClAmsCSIDescriptorT csiDescriptor) { /* If component hang detection is off, then reply with an OK first, so that this call won't time out causing the controller to kill me. */ if (debuggingOn) { if (haState != CL_AMS_HA_STATE_QUIESCING) clCpmResponse(cpmHandle, invocation, CL_OK); } /* DO NOT SET BREAKPOINTS ABOVE THIS LINE (IN THIS FUNCTION) (the controller will think that your process is dead) To set a breakpoint in this routine: 1. First set "debuggingOn" to 1 2. Set your breakpoint below this comment 3. Continue */ /* Print information about the CSI Set */ clprintf ("Component [%s] : PID [%d]. CSI Set Received\n", compName->value, mypid); clCompAppAMFPrintCSI(csiDescriptor, haState); /* Take appropriate action based on state */ switch ( haState ) { case CL_AMS_HA_STATE_ACTIVE: { /* AMF has requested application to take the active HA state for the CSI. */ break; } case CL_AMS_HA_STATE_STANDBY: { /* AMF has requested application to take the standby HA state for this CSI. */ break; } case CL_AMS_HA_STATE_QUIESCED: { /* AMF has requested application to quiesce the CSI currently assigned the active or quiescing HA state. The application must stop work associated with the CSI immediately. */ break; } case CL_AMS_HA_STATE_QUIESCING: { /* AMF has requested application to quiesce the CSI currently assigned the active HA state. The application must stop work associated with the CSI gracefully and not accept any new workloads while the work is being terminated. */ clCpmCSIQuiescingComplete(cpmHandle, invocation, CL_OK); break; } default: { /* Should never happen. Ignore. */ } } /* If I'm am killing components, then reply with an OK after execution, to ensure that the handling did not hang */ if (!debuggingOn) { if (haState != CL_AMS_HA_STATE_QUIESCING) clCpmResponse(cpmHandle, invocation, CL_OK); } return CL_OK; }
Component State Transition Timeouts
The SAFplus Platform also ensures that your component transitions through various states within reasonable a time. To debug during these transitions (startup, shutdown, etc) it is necessary to increase these timeouts. These timeouts are available in the IDE, or directly in the XML configuration files for temporary changes.
Search for the following XML in your "etc/clAmfDefinitions.xml" file and change the numbers for every component you need to debug to large values.
<timeouts>
<instantiateTimeout>10000</instantiateTimeout>
<terminateTimeout>10000</terminateTimeout>
<cleanupTimeout>10000</cleanupTimeout>
<amStartTimeout>10000</amStartTimeout>
<amStopTimeout>10000</amStopTimeout>
<quiescingCompleteTimeout>10000</quiescingCompleteTimeout>
<csiSetTimeout>10000</csiSetTimeout>
<csiRemoveTimeout>10000</csiRemoveTimeout>
<proxiedCompInstantiateTimeout>10000</proxiedCompInstantiateTimeout>
<proxiedCompCleanupTimeout>10000</proxiedCompCleanupTimeout>
</timeouts>
Creating core files on failure
Core files are configured to be automatically created and will be found int the "var/run" directory. If core files are not being created, your environment configuration may be overriding our changes.
Make sure the core file size is "unlimited" by running "ulimit -c unlimited" before executing SAFplus.
You can test core creation by jamming a SEGFAULT into a program by doing "kill -SEGV <process id>".
Pausing and Resuming a Thread
Sometimes a problem occurs during startup -- before you have the opportunity to attach to a process, set a breakpoint, and resume the process. In a standard application this is easy to debug because "gdb" starts up with your program "stopped". But in a high availability framework, where the SAFplus Platform starts and stops programs, this is trickier.
To debug this, 2 functions are available from "clDebugApi.h", one which "pauses" the calling thread, the other "resumes" it. The programmer adds the "pause" function to the code and recompiles. When "debuggingOn" is true and the function is called, the thread will output a log and block on a semaphore. The programmer can then check the logs to see if a thread was paused. If so, he goes into gdb, attaches to the process, finds the thread that is paused -- "thread apply all backtrace" ("thr a a bt") and look for a call to "pause" -- switches to that thread ("thread x"), and then "resumes" the thread by calling the "resume" function from gdb's prompt:
gdb> call clDbgResume()
The "pause" function is actually a macro so that I can also pass the file and line from which the function was called. This information is very useful if you accidentally forget to remove a call to clDbgPause...
Intercepting the Automatic Start of a Component
It is possible to run your component within a wrapper application (such as gdb or purify). When the SAFplus Platform is deployed on a target, all applications are placed in the "bin" subdirectory of the SAFplus Platform tree (for example "/root/asp/bin"). To intercept the application start for debugging purposes:
- rename your application
- create a script and name it the same as the original application name
- edit the script to add your wrapper.
- Note that you will have to start a separate window for applications that need a console (like gdb)
For example the following simple script can be used to make your application start within gdb (note though that it must be used in conjunction with the techniques described in the "Stopping Process Health Monitoring" section above unless you are a very fast typist, since your application's initialization will be monitored by the OpenClovis HA infrastructure to ensure that it succeeds). This script assumes that your application is called "myapp" and that you moved it to "/root/asp/bin/myapp.orig". It also assumes that your Linux installation has X-windows and the gnome desktop environment installed (however, a similar technique can be used for other window managers):
#!/bin/bash
/usr/bin/gnome-terminal --window --command "gdb /root/asp/bin/myapp.orig $1 $2 $3 $4 $5 $6 $7 $8 $9"