Bug 144806 - ccsd not handeling all clu_connect errors on startup appropriately
ccsd not handeling all clu_connect errors on startup appropriately
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: ccs (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-01-11 12:24 EST by Adam "mantis" Manthei
Modified: 2010-01-27 13:03 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-01-27 13:03:52 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
add additonal error checks on startup (1.48 KB, text/plain)
2005-01-11 12:27 EST, Adam "mantis" Manthei
no flags Details

  None (edit)
Description Adam "mantis" Manthei 2005-01-11 12:24:19 EST
Description of problem:
The magma plugin that I have for gulm is for protocol version
0x67000014 and my server is protocol version 0x67000015.  lock_gulmd
will not allow the plugin to connect and is causing errno to be set to 
EAFNOSUPPORT.  This particular error will never allow clu_connect() to
succeed.  

When starting the the cluster_communicator thread, ccsd checks for a
few error cases, but otherwise determines that all other errors are
acceptable.  Perhaps a better sollution would be to treat all errors
as terminal and report the problem to the parent process.  Perhaps a
retry count could also be added to add a little more robustness for
when cman or gulm have yet to be started


Version-Release number of selected component (if applicable):
/ccsd.c/1.14.2.2/Tue Jan  4 23:31:30 2005//TRHEL4
/cluster_mgr.c/1.10.2.1/Tue Jan  4 21:59:14 2005//TRHEL4
/cluster_mgr.h/1.2/Thu Aug 12 18:21:03 2004//TRHEL4


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
While in the clu_connect loop, ccsd will die if it receives a SIGHUP
and then exit without any messages in the logs.  If ccsd does not
return until a successful clu_connect call, the dieing by SIGHUP is a
little less unexpected.
Comment 1 Adam "mantis" Manthei 2005-01-11 12:27:20 EST
Created attachment 109621 [details]
add additonal error checks on startup

This adds additional error checking on startup.  If ccsd can't connect to magma
after CCSD_CONNECT_RETRY seconds, it will fail and print an error to stderr
(The #define for CCSD_CONNECT_RETRY is in a gross spot, put it at least
demonstrates my intent)
Comment 2 Adam "mantis" Manthei 2005-01-11 13:11:50 EST
The above patch does lead to other problems in that ccsd will not
return until it connects to cman or gulm... this will cause problems
for the init scripts since gulm/cman are started after ccsd.

Is it better for ccsd to stop after failing to connect w/ clu_connect
after so many seconds?  At the very least, there should probably be
some messages that are printed after a certain number of failed
clu_connect() calls indicating in the logs that ccs is having issues.
 (This is not obvious unless you are looking at the code)

We might also want to concider ignoring SIGHUP or log a message
stating that ccsd is not ready to process the cluster.conf file until
the clu_connect call succeeds instead of dieing by default as we do
right now.  

Comment 3 Jonathan Earl Brassow 2005-01-11 19:15:36 EST
A warning is now printed every ten seconds if a connection to the
cluster infrastructure can not be made.

This is like saying the user must run 'cman_tool join' or 'lock_gulmd'
within 10 seconds of starting ccsd.

Perhaps it would be wise to bump this value to a larger number and
special case the EAFNOSUPPORT.

Note You need to log in before you can comment on or make changes to this bug.