Red Hat Bugzilla – Bug 144806
ccsd not handeling all clu_connect errors on startup appropriately
Last modified: 2010-01-27 13:03:52 EST
Description of problem:
The magma plugin that I have for gulm is for protocol version
0x67000014 and my server is protocol version 0x67000015. lock_gulmd
will not allow the plugin to connect and is causing errno to be set to
EAFNOSUPPORT. This particular error will never allow clu_connect() to
When starting the the cluster_communicator thread, ccsd checks for a
few error cases, but otherwise determines that all other errors are
acceptable. Perhaps a better sollution would be to treat all errors
as terminal and report the problem to the parent process. Perhaps a
retry count could also be added to add a little more robustness for
when cman or gulm have yet to be started
Version-Release number of selected component (if applicable):
/ccsd.c/18.104.22.168/Tue Jan 4 23:31:30 2005//TRHEL4
/cluster_mgr.c/22.214.171.124/Tue Jan 4 21:59:14 2005//TRHEL4
/cluster_mgr.h/1.2/Thu Aug 12 18:21:03 2004//TRHEL4
Steps to Reproduce:
While in the clu_connect loop, ccsd will die if it receives a SIGHUP
and then exit without any messages in the logs. If ccsd does not
return until a successful clu_connect call, the dieing by SIGHUP is a
little less unexpected.
Created attachment 109621 [details]
add additonal error checks on startup
This adds additional error checking on startup. If ccsd can't connect to magma
after CCSD_CONNECT_RETRY seconds, it will fail and print an error to stderr
(The #define for CCSD_CONNECT_RETRY is in a gross spot, put it at least
demonstrates my intent)
The above patch does lead to other problems in that ccsd will not
return until it connects to cman or gulm... this will cause problems
for the init scripts since gulm/cman are started after ccsd.
Is it better for ccsd to stop after failing to connect w/ clu_connect
after so many seconds? At the very least, there should probably be
some messages that are printed after a certain number of failed
clu_connect() calls indicating in the logs that ccs is having issues.
(This is not obvious unless you are looking at the code)
We might also want to concider ignoring SIGHUP or log a message
stating that ccsd is not ready to process the cluster.conf file until
the clu_connect call succeeds instead of dieing by default as we do
A warning is now printed every ten seconds if a connection to the
cluster infrastructure can not be made.
This is like saying the user must run 'cman_tool join' or 'lock_gulmd'
within 10 seconds of starting ccsd.
Perhaps it would be wise to bump this value to a larger number and
special case the EAFNOSUPPORT.