Contents |
csa104 Messaging
Messaging refers to communications between components within the cluster. SAFplus provides a SA-Forum compliant implementation of the messaging service. In one sentence, the messaging service is a reliable packet based communications mechanism which stores and addresses endpoints via cluster-wide message queues that are identified by a well-known name and can be bound to any running process. For more details and API reference please see the SA-Forum spec SA-AIS-MSG-B*.pdf.
Objective
csa104 demonstrates the use of the SAFplus Messaging service to provide basic communications during process and node failure.
What Will You Learn
- how to initialize the messaging client library,
- how to create a message queue.
- how to send messages to the queue.
- how to receive messages from the queue.
- how to take over the message queue after process failure.
The Code
The code can be found within the following directory:
<SAFplus_installation_dir>/src/examples/eval/src/app/csa104Comp
All messaging code has been isolated into a single module that consists of 2 files: msgFns.c and msgFns.h to increase readability. These files provide the following APIs.
msgFns.h |
---|
|
These APIs constitute the basic operations of any messaging library; initialize, open, send and receive.
The following constants are also defined:
msgFns.h |
---|
|
The ACTIVE_COMP_QUEUE defines the well-known name of the messaging queue which will be used in this example. The QUEUE_LENGTH defines the maximum size of the buffer allocated for each priority within a particular queue. There are 4 possible message priorities and the maximum buffer size can be set on a per priority basis.
Initialization
Initialization follows the standard SA Forum library lifecycle pattern:
msgFns.c |
---|
|
The messaging library supports either a callback or threaded paradigm. In this example, we will use threading so no callbacks are installed and therefore it is not necessary to periodically call saMsgDispatch(...).
To receive messages from the queue, the application must first open it. The open essentially binds the well-known name to the application process so that senders know where to direct messages. By providing an explicit bind operation (rather then folding it into the library initialize) the API allows the application to choose when it takes ownership of the queue; this could be when it becomes active (or standby) for example, allowing senders to address messages to a single well-known queue name if they need to communicate with the active component.
Designing addresses that represent a concept such as "The currently active transaction server" rather then a physical entity is an extremely powerful design pattern used throughout highly available applications because it means that the sender does not need to access the real-time cluster state to determine this mapping and does not need to handle errors caused by application failure (except perhaps with a simple retry-on-error loop).
msgFns.c |
---|
|
Transmission
Message transmission uses the saMsgMessageSend API which allows the application to pass a message buffer and a bunch of meta-data (such as message version, sender's name, etc) that will be sent to the receiver:
msgFns.c |
---|
|
For simplicity, this example creates the destination SaNameT and the SaMsgMessageT structures each time a message is sent. However, for efficiency when sending multiple messages to the same destination it is preferred to pre-create and reuse these objects.
Receipt
This example will create a dedicated message receiver thread and run the following code within that thread:
msgFns.c |
---|
|
This code simply loops "forever" receiving messages and printing the contents of the message. If anything goes wrong it kicks itself out of the message processing loop and closes the message queue. Closing the queue will allow some other application to open it, in effect "taking over" the queue. If an application is killed or disappears due to node death (or other events) the AMF will close all queues opened by the application. This allow the new active application to take control of the queue.
Putting it all Together
These functions are called from the application's clCompAppMain.c
file to implement an application that periodically sends a message from the standby to the active application. First we define a helper function that will repeatedly send messages. This function will loop so long as the application is standby, as indicated by a global variable "standby" that is set in the work assignment callback.
clCompAppMain.c |
---|
|
Next, we initialize the messaging library from main():
clCompAppMain.c |
---|
|
In the work assignment callback, we open the queue and spawn the message receiver thread when an active assignment is received, and spawn a message sender thread when a standby assignment is received.
clCompAppMain.c |
---|
|
How to Run csa104 and What to Observe
This sample application runs 2 processes on SCNodeI0 (first system controller) in all the hardware setups described at the beginning of this eval guide. While it is certainly possible to run messaging across multiple nodes, this single node configuration makes evaluation simpler.
csa104 is "enabled" by default when SAFplus is started so there is no need to enter the SAFplus Debug Console and change its program state.
The following output is given when you run tail -f
on the csa104 log files. For example:
# ./eval start
# tail -f var/log/csa104CompI?Log.latest
/root/asp/var/log/csa103CompI?Log.latest |
---|
|
The logs show the work assignment occurring with csa104CompI0 as ACTIVE and csa104CompI1 as STANDBY. This causes the csa104CompI1 (standby) component to start sending messages and the csa104CompI0 (active) component to begin receiving them.
Sometimes the output shows a double send, double receive or even a receive before a send! This is an effect of the tail program polling the logs and not an actual issue.
Next, find the active csa104 process and kill it. In this case, its the one that's receiving the messages. The process ID is available in the log which is formmated as follows:
date (Node .PID : compname.---.---.logCount : SEVERITY) Message Fri Jan 11 14:38:55.255 2013 (SCNodeI0.3301 : csa104Comp.---.---.00020 : INFO) Received Message : Msg 6 from csa104CompI1
So in the log above the active process is pid 3301.
While keeping the tail running, open a new window and run the kill command:
# kill 3301
If this step does not result in the active component being killed then it is likely that the standby component was killed. In this case simply try killing the other process.
After killing the active component you should see lines in the log files like the following:
/root/asp/var/log/csa103CompI1Log.latest |
---|
|
csa203
csa203 demonstrates the usage of SA Forum's Checkpointing service. This sample application does not deviate functionally from csa103. The code differences are due to using SA Forum data types (structures) and APIs , as presented in the following two tables. (Note we have not repeated data types and APIs covered previously.)
SA Forum Data Types | OpenClovis Data Types |
---|---|
SaCkptHandleT | ClCkptSvcHdlT |
SaCkptHandleT | ClCkptHdlT |
SaCkptSectionIdT | ClCkptSectionIdT |
SaCkptCheckpointCreationAttributesT | ClCkptCheckpointCreationAttributesT |
SaCkptSectionCreationAttributesT | ClCkptSectionCreationAttributesT |
SaCkptIOVectorElementT | ClCkptIOVectorElementT |
SA Forum APIs | OpenClovis APIs |
---|---|
saCkptInitialize | clCkptInitialize |
saCkptCheckpointOpen | clCkptCheckpointOpen |
saCkptSectionCreate | clCkptSectionCreate |
saCkptCheckpointClose | clCkptCheckpointClose |
saCkptFinalize | clCkptFinalize |
saCkptSectionOverwrite | clCkptSectionOverwrite |
saCkptCheckpointSynchronize | clCkptCheckpointSynchronize |
saCkptCheckpointRead | clCkptCheckpointRead |
How to Run csa103 and What to Observe
This sample application can run on all Runtime Hardware Setups. The following, lists which node csa103compI*
runs on.
- Runtime Hardware Setup 1.1 and 2.1
csa103compI0 and csa103compI1 will run upon the single node. - Runtime Hardware Setup 1.2 and 2.2
csa103compI0 runs on PayloadNodeI0 and csa103compI1 runs on PayloadNodeI1 - Runtime Hardware Setup 1.3 and 2.3
csa103compI2 and csa103compI3 run on SCNodeI0 and SCNodeI1 respectively. csa103compI0 and csa103compI1 run on PayloadNodeI0 and PayloadNodeI1 respectively.
In the four node set up SCNodeI0 and SCNodeI1 output data via/root/asp/var/log/csa103CompI2
and/root/asp/var/log/csa103CompI3
respectively. Therefore in this case follow the below instructions replacing csa103CompI0 and csa103compI1 with csa103compI2 and csa103compI3 respectively.
We run csa103 very much the same way we run any of the rest of the sample applications: with the SAFplus Platform Console.
- First, on the active System Controller move it to state LockAssignment with (Unlock csa203SGI0 instead of csa103SGI0 to run csa203)
cli[Test]-> setc 1 cli[Test:SCNodeI0]-> setc cpm cli[Test:SCNodeI0:CPM]-> amsLockAssignment sg csa103SGI0
The following output is given when you run
tail -f
on the csa103 log files. For example:# tail -f /root/asp/var/log/csa103CompI0.log
/root/asp/var/log/csa103CompI0Log.latest Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00030 : INFO) Component [csa103CompI0] : PID [15238]. Initializing Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00031 : INFO) IOC Address : 0x1 Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00032 : INFO) IOC Port : 0x81 Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00033 : INFO) csa103: Instantiated as component instance csa103CompI0. Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00034 : INFO) csa103CompI0: Waiting for CSI assignment... Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00035 : INFO) csa103CompI0: Waiting for CSI assignment... Mon Jul 14 00:01:09 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00036 : INFO) csa103CompI0: checkpoint_initialize Mon Jul 14 00:01:10 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00043 : INFO) csa103CompI0: Checkpoint service initialized (handle=0x1) Mon Jul 14 00:01:10 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00045 : INFO) csa103CompI0: Checkpoint opened (handle=0x2)
# tail -f /root/asp/var/log/csa103CompI1.log
/root/asp/var/log/csa103CompI1Log.latest Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00030 : INFO) Component [csa103CompI1] : PID [15234]. Initializing Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00031 : INFO) IOC Address : 0x1 Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00032 : INFO) IOC Port : 0x80 Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00033 : INFO) csa103: Instantiated as component instance csa103CompI1. Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00034 : INFO) csa103CompI1: Waiting for CSI assignment... Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00035 : INFO) csa103CompI1: Waiting for CSI assignment... Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00036 : INFO) csa103CompI1: checkpoint_initialize Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00043 : INFO) csa103CompI1: Checkpoint service initialized (handle=0x1) Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00045 : INFO) csa103CompI1: Checkpoint opened (handle=0x2) Mon Jul 14 00:01:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00053 : INFO) csa103CompI1: Section created
- Then, unlock the application by running
cli[Test:SCNodeI0:CPM]-> amsUnlock sg csa103SGI0
In the application log files, you should see
/root/asp/var/log/csa103CompI0Log.latest Mon Jul 14 00:09:08 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00058 : INFO) Standby state requested from state 0
/root/asp/var/log/csa103CompI1Log.latest Mon Jul 14 00:09:08 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00064 : INFO) csa103CompI1: Active state requested from state 0 Mon Jul 14 00:09:09 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00065 : INFO) csa103CompI1: Hello World! (seq=0) Mon Jul 14 00:09:10 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00066 : INFO) csa103CompI1: Hello World! (seq=1) Mon Jul 14 00:09:11 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00067 : INFO) csa103CompI1: Hello World! (seq=2) Mon Jul 14 00:09:12 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00068 : INFO) csa103CompI1: Hello World! (seq=3) Mon Jul 14 00:09:13 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00069 : INFO) csa103CompI1: Hello World! (seq=4)
- Next, find the active csa103 process, the one that's printing the Hello World lines and kill it. To find the process ID issue the following command from a bash shell.
# ps -eaf | grep csa103
This should produce an output that looks similar to the following.
root 17830 15663 0 14:21 ? 00:00:00 csa103Comp -p root 17839 15663 0 14:21 ? 00:00:00 csa103Comp -p root 18558 16145 0 14:32 pts/4 00:00:00 grep csa103
Notice the two entries that end with csa103Comp -p. These are our two component processes. The first one is usually the active process. This is the one that we will kill. In this case the process ID is 17830. So to kill the active component you issue the command:
# kill -9 17830
If this step does not result in the active component being killed then it is likely that the standby component was killed. In this case simply try killing the other process.
After killing the active component you should see lines in the log files like the following:
/root/asp/var/log/csa103CompI1Log.latest Mon Jul 14 00:16:43 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00515 : INFO) csa103CompI1: Hello World! (seq=452) Mon Jul 14 00:16:44 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00516 : INFO) csa103CompI1: Hello World! (seq=453) Mon Jul 14 00:16:45 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00517 : INFO) csa103CompI1: Hello World! (seq=454) Mon Jul 14 00:16:46 2008 (SCNodeI0.15234 : csa103CompEO.---.---.00518 : INFO) csa103CompI1: Hello World! (seq=455)
/var/log/csa103CompI0Log.latest Mon Jul 14 00:16:48 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00065 : INFO) csa103CompI0: Active state requested from state 2 Mon Jul 14 00:16:48 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00066 : INFO) csa103CompI0 reading checkpoint Mon Jul 14 00:16:48 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00067 : INFO) csa103CompI0 read checkpoint: seq = 456 Mon Jul 14 00:16:49 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00068 : INFO) csa103CompI0: Hello World! (seq=456) Mon Jul 14 00:16:50 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00069 : INFO) csa103CompI0: Hello World! (seq=457) Mon Jul 14 00:16:51 2008 (SCNodeI0.15238 : csa103CompEO.---.---.00070 : INFO) csa103CompI0: Hello World! (seq=458)
Where we can see CompI1 printing 455 and then dieing, where upon CompI0 gets the notice to take over processing, reads the checkpoint and then takes over with seq=456 and so on.
Then, in the CompI1 log file we see:
/root/asp/var/log/csa103CompI1Log.latest Mon Jul 14 00:16:49 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00044 : INFO) csa103: Instantiated as component instance csa103CompI1. Mon Jul 14 00:16:49 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00045 : INFO) csa103CompI1: Waiting for CSI assignment... Mon Jul 14 00:16:49 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00046 : INFO) csa103CompI1: Waiting for CSI assignment... Mon Jul 14 00:16:49 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00047 : INFO) csa103CompI1: checkpoint_initialize Mon Jul 14 00:16:50 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00054 : INFO) csa103CompI1: Checkpoint service initialized (handle=0x1) Mon Jul 14 00:16:50 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00056 : INFO) csa103CompI1: Checkpoint opened (handle=0x2)
That is the component that had been killed being restarted.
CompI0 is moving along just fine, and we see CompI1 come back up. If we then kill CompI0 we see:
/root/asp/var/log/csa103CompI1Log.latest Mon Jul 14 00:29:44 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00065 : INFO) csa103CompI1: Active state requested from state 2 Mon Jul 14 00:29:44 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00066 : INFO) csa103CompI1 reading checkpoint Mon Jul 14 00:29:44 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00067 : INFO) csa103CompI1 read checkpoint: seq = 1225 Mon Jul 14 00:29:45 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00068 : INFO) csa103CompI1: Hello World! (seq=1225) Mon Jul 14 00:29:46 2008 (SCNodeI0.15553 : csa103CompEO.---.---.00069 : INFO) csa103CompI1: Hello World! (seq=1226)
Where we can see the CompI0 process die, and CompI1 process read the sequence number from the checkpoint and then take over from where CompI0 left off.
- To stop csa103 use the following SAFplus Platform Console command.
cli[Test:SCNodeI0:CPM]-> amsLockAssignment sg csa103SGI0
- Now change the state of csa103SGI0 to LockInstantiation and close the SAFplus Platform Console.
cli[Test:SCNodeI0:CPM]-> amsLockInstantiation sg csa103SGI0 cli[Test:SCNodeI0:CPM] -> end cli[Test:SCNodeI0] -> end cli[Test] -> bye
Additional Tests for Runtime Hardware Setup 1.3 and 2.3
Now repeat the experiment till the stage before we kill the active process and follow these steps.
Step A
- Stop SAFplus Platform on SCNodeI0 using
/etc/init.d/asp stop
. Observer the logs/root/asp/var/log/csa103CompI3Log.latest
on SCNodeI1. This will become active and start printing the hello world logs as above. Note in this case active System Controller SCNode1 is still aware of PayloadNodeI0 and PayloadNodeI1. - Start looking at the logs for
/root/asp/var/log/csa103CompI0Log.latest
in PayloadNodeI0. This will print the standby logs as above. - Stop SAFplus Platform on PayloadNodeI0 and start observing the logs on PayloadNodeI1 in
/root/asp/var/log/csa103CompI1Log.latest
for standby. In this case, you can see, via the active System Controller logs that PayloadNodeI0 is not active whilst PayloadNodeI1 is.
Step B For Hardware Setup 1.3 only
- Repeat A with the slight exception that the blade running SCNodeI0 is yanked out(
/etc/init.d/asp zap
) instead of gracefully shutting down SAFplus Platform in step 1.
Summary and References
We've seen :
- How to use SAFplus Messaging to communicate between any 2 processes
- How to take over a communications channel after a failover
Further information can be found within the following: SA-AIS-MSG-B* (SA-Forum Messaging Specification), 'OpenClovis API Reference Guide.