Bug 144806

Summary: ccsd not handeling all clu_connect errors on startup appropriately
Product: [Retired] Red Hat Cluster Suite Reporter: Adam "mantis" Manthei <amanthei>
Component: ccsAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-01-27 18:03:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
add additonal error checks on startup none

Description Adam "mantis" Manthei 2005-01-11 17:24:19 UTC
Description of problem:
The magma plugin that I have for gulm is for protocol version
0x67000014 and my server is protocol version 0x67000015.  lock_gulmd
will not allow the plugin to connect and is causing errno to be set to 
EAFNOSUPPORT.  This particular error will never allow clu_connect() to
succeed.  

When starting the the cluster_communicator thread, ccsd checks for a
few error cases, but otherwise determines that all other errors are
acceptable.  Perhaps a better sollution would be to treat all errors
as terminal and report the problem to the parent process.  Perhaps a
retry count could also be added to add a little more robustness for
when cman or gulm have yet to be started


Version-Release number of selected component (if applicable):
/ccsd.c/1.14.2.2/Tue Jan  4 23:31:30 2005//TRHEL4
/cluster_mgr.c/1.10.2.1/Tue Jan  4 21:59:14 2005//TRHEL4
/cluster_mgr.h/1.2/Thu Aug 12 18:21:03 2004//TRHEL4


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
While in the clu_connect loop, ccsd will die if it receives a SIGHUP
and then exit without any messages in the logs.  If ccsd does not
return until a successful clu_connect call, the dieing by SIGHUP is a
little less unexpected.

Comment 1 Adam "mantis" Manthei 2005-01-11 17:27:20 UTC
Created attachment 109621 [details]
add additonal error checks on startup

This adds additional error checking on startup.  If ccsd can't connect to magma
after CCSD_CONNECT_RETRY seconds, it will fail and print an error to stderr
(The #define for CCSD_CONNECT_RETRY is in a gross spot, put it at least
demonstrates my intent)

Comment 2 Adam "mantis" Manthei 2005-01-11 18:11:50 UTC
The above patch does lead to other problems in that ccsd will not
return until it connects to cman or gulm... this will cause problems
for the init scripts since gulm/cman are started after ccsd.

Is it better for ccsd to stop after failing to connect w/ clu_connect
after so many seconds?  At the very least, there should probably be
some messages that are printed after a certain number of failed
clu_connect() calls indicating in the logs that ccs is having issues.
 (This is not obvious unless you are looking at the code)

We might also want to concider ignoring SIGHUP or log a message
stating that ccsd is not ready to process the cluster.conf file until
the clu_connect call succeeds instead of dieing by default as we do
right now.  



Comment 3 Jonathan Earl Brassow 2005-01-12 00:15:36 UTC
A warning is now printed every ten seconds if a connection to the
cluster infrastructure can not be made.

This is like saying the user must run 'cman_tool join' or 'lock_gulmd'
within 10 seconds of starting ccsd.

Perhaps it would be wise to bump this value to a larger number and
special case the EAFNOSUPPORT.