Doc:latest/sdkguide/knowledgebase

Revision as of 04:12, 27 August 2010 by Senthilk (Talk)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Contents

Knowledge Base

Logging

What is the default log rotation policy, and how can it be set?

The default file rotation policy is WRAP (as specified by fileFullAction tag in clLog.xml file in <model-name>/src/config or <sandbox>/etc/). This value has to be specified per stream.

The value WRAP means that the log service will start to overwrite the file once log file becomes full (the file size can be specified by fileUnitSize tag (again per stream).

Other possible actions are HALT and ROTATE. HALT means that log service will stop writing into file as soon as it becomes full. ROTATE policy means that log service will rotate the logs up to certain number of files (as specified by maximumFilesRotated tag) before overwriting the oldest log file. This is similar to the policy used by /var/log/messages.

It can be set either in IDE or by hand modifying the clLog.xml file. In IDE you can set it under log section in SAFplus Platform component configuration in clovis menu. Please refer to ide user guide section 7.2.2. Only thing to remember is that you need to set maximumFilesRotated to 0 if rotation policy is either HALT or WRAP.

Programmatically it can be set when you are creating a new log stream using clLogStreamOpen() API. You need to set the fileFullAction field of pStreamAttr field (of type ClLogStreamAttributesT) which is fourth parameter to clLogStreamOpen() API.

Can it be set dynamically?

Once the stream is already opened with a set of attributes (through APIs at runtime), its attributes cannot be changed thereafter. So if you already opened a stream with fileFullAction set to ROTATE, you cannot change it to HALT.

Is there a way to operationally force a log file rotation?

The rotation for nodename.log files could be forced through debug cli on cpm context.

bin/asp_console

setc master ;setc cpm; logrotate

(setc master would directly work on AMF master. You could also do : setc <slotid> to target a particular node listed with "list")

After the above debug cli, a subsequent: amsdbprint output goes to a new nodename.log and the last one is moved to nodename.log.1.

I also found it pretty useful.

Unrelated though but useful for you: export CPM_LOG_FILE_SIZE=5m export CPM_LOG_FILE_ROTATIONS=3 would force the nodename log files to be rotated within 5MB limit 3 times.

The default is 10MB, 4 rotations.


Is the flush interval applicable to log server or log client?

This is applicable to the Log server. i.e. When an application has any logs to be written, the log client would immediately write the records into the shared memory, and it won't wait for the flush interval. However, the log server would flush these records from the shared memory to the log handler (a file) at the given flush interval. This means that if a component crashes or hangs all logs will be written. However if the board fails, it is possible to lose all logs that haven't been flushed.


Does SAFplus Platform take any action when we fail to log? For example, if the log server is not able to log the records because the disk is full, then do we take any action so that we failover?

No. As of now the log server does not escalate this error to the AMF and hence failover is not done.

Can logs appear on both the system controller nodes?

By default the SAFplus Platform is configured to log to a local file for debugging convenience. However, the SAFplus Platform allows logs to be sent to other nodes in the system via configuration in the log.xml file. Additionally, a program on any node can always open any global log stream and process or forward the logs in any desired manner. See the Log API documentation for more details.

Startup Log Messages

Invalid [identifier] value in tag or attribute, rc=[0x0]

ERROR) Cannot get IOC master, return code [0x50016]

These can be ignored during bootup. "Cannot get IOC master" from log server during bootup is the phase during which AMF awaits leader election callback from GMS for the cluster leader and our log service is dependent on this since it also has HA.

ERROR) Configuration file [.../etc/clAmfEntityTrigger.xml] is missing

The AMF trigger xml parse error is fine (unless, of course you are using the AMF trigger feature). It refers to a feature w.r.t feeding triggers into a amftrigger framework controlled by thresholds for various entities as defined in clAmfEntityTrigger.xml which won't be present for normal cases.

ERROR) Parser open error for file [clAmfEntityTrigger.xml]

The AMF trigger xml parse error is fine (unless, of course you are using the AMF trigger feature). It refers to a feature w.r.t feeding triggers into a amftrigger framework controlled by thresholds for various entities as defined in clAmfEntityTrigger.xml which won't be present for normal cases.

WARN) DB not present. Reading AMS config from XML

The DB read warning from AMS means that you are either doing a first time startup or a start with --remove-persistent-db option (./etc/init.d/asp start --remove-persistent-db). It means that the model configuration is being read from XML files instead of from database, so the model that comes up will be the unmodified model created in the IDE. Any modification you may have made through the AMF mgmt API calls are stored in the DB.

Alarms

Could I replace OpenClovis's alarm mib (current alarm table...) with a proprietary MIB?

First of all there is no OpenClovis' propriety alarm MIB. Customers can use any propriety MIB with alarms defined and the generated code would take care of raising traps when an alarm is raised.

Is there an alarm history mib?

No. But the alarm history can be viewed/queried via our debug CLI commands viz. queryAlarm, querypublishedAlarms, showpublishedAlarms.

Mixed Endian and Mixed Word-size (Heterogeneous) Clusters

How are mixed endian and mixed word-size clusters supported?

Certain OpenClovis communications mechanisms, notably RMD (our remote procedure call library), allow communications to occur between machines of different endian (endian agnostic). Since all OpenClovis components are based on RMD, this allows OpenClovis to run in a heterogeneous environment.

Xdr is the library that transforms data structures to and from network-endian format (and into canonical-size data items) so that they can be sent to other machines. RMD uses Xdr internally to implement its endian-agnostic feature.

The customer can also take advantage of XDR and RMD. There are 2 ways; first, you can always just access the APIs directly (for example the APIs located in clXdrApi.h to do endian-swapping). Secondly, by defining your remote procedure calls in an XML format that we call "IDL" we will automatically generate wrappers around the RMD and XDR APIs that implement the marshalling and unmarshalling required to do a remote procedure call, with the result that on the client side the remote procedure call looks just like a normal function call.

Is Mixed Endian Checkpointing supported?

The SAF checkpoint API does not handle endian-swapping as it is defined based on a byte buffer. Unfortunately, without knowing the structure of the data it is impossible to byte-swap data to be checkpointed. Therefore to implement endian-safe checkpointing, you first transform the buffer via calls to the XDR library, and then write the checkpoint. Of course, upon read you must again use XDR to transform the buffer back to the machine's endian.

Are there any limitations to using mixed endian in the cluster?

The 2 system controllers must be of the same architecture (endian and word size). By requiring this, we reduce the latency and the CPU utilization of the communication that occurs between the 2 system controllers.


Failovers

When a split brain scenario (double failure of ethernet link) is healed, who becomes master?

In case of split brain, always the node with higher node ID will take over as leader. So you could set a higher node ID to the node which you want to be the leader, and this will take effect on merge from split brain.

Healthcheck/Heartbeat

Is component health check supported?

Yes, there are two component heathcheck mechanisms, passive and SAF. Passive healthchecking is enabled by default. To enable the SAF healthcheck just use the SAF APIs supported by OpenClovis. The minimum SAF healthcheck period is 5 seconds. However, the passive healthcheck operates in the order of few milliseconds. Essentially, the SAFplus Platform intranode communications mechanism implements a healthcheck. Whenever a component dies, the TIPC Linux driver is notified since each component has open TIPC sockets. The SAFplus Platform AMF receives TIPC notifications of component deaths and publishes an event. Any of your applications interested in knowing the departure of the component can subscribe for these events and act appropriately. Note that this is the faster means of healthcheck (in the order of few milliseconds), and this is the reason active healthcheck is disabled by default. To use this means of healthcheck your program simply must quit or abort (exit(), abort(), assert() for example) when it has an unrecoverable error.

This passive heartbeating through TIPC avoids majority of the requirement for a SAF-style healthcheck except for the case when the component is deadlocked. To take care of this case you can enable SAF healthcheck, or you can create your own periodic thread that analyses verifies the operation of your program and exits if there is a problem.

AMF Configuration (Cluster-wide Application Structure)

Load Balancing Applications: We have two applications that have a dependency on each other, such that if either one restarts the other must also be restarted. The complication is they are running on separate blades, and therefore separate service groups. Can you suggest how we can restart the dependant application when either is restarted?

There are 2 ways of adding dependencies in the system, inter-node and inter-Service Instance.

If you want application 1 to depend on 2, then I would recommend using SI (service instance) dependencies. SI dependencies essentially means that the dependent application (2) will not be "unlocked" until application 1 is unlocked, and 2 will be "locked assignment" if 1 fails. Now this may not be quite what you were asking for, because you wanted 2 to be "restarted" (aka "locked instantiation" in SAF terminology), not just moved to "locked assignment". However, it should not be too hard to implement "shutdown" logic at the application layer in app 2 whenever the app is "locked assignment".

The best "shutdown" implementation would be to stop the app's processing and wait in an idle loop. That interpretation is most in line with SAF concepts and will help you when you add HA to the systems.

To set up SI dependencies, in the IDE use Clovis->AMF Configuration->Service Group List->(pick dependent SG)->Service Instance List->(Pick your SI)->Dependencies Edit button. A dialog box will pop up showing all SIs in the system. Pick the one you want to be dependent on.

The inter-node method creates dependencies between nodes, which only is useful to solve the above problem if you are running just one application on each node. If you want the system to require blade B, than you can specify all other blades to be dependent on B. This is available in the IDE under the Clovis->AMF Configuration->Node Instance List->(pick dependent node)->Dependencies Edit button. Check node B and then click OK. You need to do this for every node that is dependent on B.

How do I make an application automatically run (or not run) when the SAFplus Platform is started?

You can start and stop applications using our debug cli; refer to the evaluation guide as to how to do so. But to have the system start up with a particular application stopped, you must modify the startup configuration. You may do so via the IDE and regenerate the configuration files, but in this case it will be faster to simply edit the config files directly.

Go to the system controller blade and load the file "/root/asp/etc/amfDefinitions.xml". In here, all of the "service groups" (SAF terminology for a highly available group of applications) are defined. Each definition looks like this:

   <sgType name="vlcGroup_MN">
     <failbackOption>CL_FALSE</failbackOption>
     <restartDuration>10000</restartDuration>
     <numPrefActiveSUs>2</numPrefActiveSUs>
     <numPrefStandbySUs>1</numPrefStandbySUs>
     <numPrefInserviceSUs>3</numPrefInserviceSUs>
     <numPrefAssignedSUs>3</numPrefAssignedSUs>
     <numPrefActiveSUsPerSI>1</numPrefActiveSUsPerSI>
     <maxActiveSIsPerSU>1</maxActiveSIsPerSU>
     <maxStandbySIsPerSU>2</maxStandbySIsPerSU>
     <compRestartDuration>10000</compRestartDuration>
     <compRestartCountMax>0</compRestartCountMax>
     <suRestartDuration>10000</suRestartDuration>
     <suRestartCountMax>0</suRestartCountMax>
     <adminState>CL_AMS_ADMIN_STATE_LOCKED_I</adminState>
     <redundancyModel>CL_AMS_SG_REDUNDANCY_MODEL_M_PLUS_N</redundancyModel>
     <loadingStrategy>CL_AMS_SG_LOADING_STRATEGY_LEAST_SI_PER_SU</loadingStrategy>
     <autoRepair>CL_TRUE</autoRepair>
   </sgType>

Note that the first tag names the "service group": <sgType name="vlcGroup_MN">

In this example the service group is named "vlcGroup_MN". Go down to the <adminState> tag. It is probably "CL_AMS_ADMIN_STATE_UNLOCKED". "Unlocked" is SAF terminology for "run this application". To not run the application when the system is started, change this field to "CL_AMS_ADMIN_STATE_LOCKED_I". "LOCKED_I" is short for "Locked Instantiation", which is the SAF terminology for "do not run this application".

So, specifically, set all service groups to "CL_AMS_ADMIN_STATE_LOCKED_I" except for the one named "vlcGroup". That one is the havlc program.

Remember to make this edit on the system controller! Making this change on any other blade will have no effect, since it is the system controller that starts and stops applications on all blades.