Bug 149288 - cman_tool: Error waiting for cluster
Summary: cman_tool: Error waiting for cluster
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-02-21 23:24 UTC by Adam "mantis" Manthei
Modified: 2009-04-24 14:35 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-04-24 14:35:33 UTC
Embargoed:


Attachments (Terms of Use)
bug demonstration script (1.96 KB, text/plain)
2005-02-21 23:50 UTC, Adam "mantis" Manthei
no flags Details
log files from test run (2.50 KB, text/plain)
2005-02-21 23:53 UTC, Adam "mantis" Manthei
no flags Details

Description Adam "mantis" Manthei 2005-02-21 23:24:31 UTC
Description of problem:
I'm seeing the error "cman_tool: Error waiting for cluster" when
running "cman_tool -w join" while the cluster has other nodes logging
in and out of it.  This can cause machines to fail to start cman on
startup which means that they will also fail to start clvmd and GFS.

Version-Release number of selected component (if applicable):
[root@trin-04 ~]# rpm -qa | grep cman
cman-kernheaders-2.6.9-18.0
cman-1.0-0.pre23.1
cman-kernel-2.6.9-21.0


How reproducible:
I can reproduce this evertime with my test script that I'll be soon
attaching

Steps to Reproduce:
1. I have six nodes in my cluster.  on each, I run:
    while : ; do
        service cman start || break
        service cman stop  || break
    done

2. 3 nodes typically join the cluster (the cman initscript uses
`cman_tool -w join`) and start shutting down before the other three
finish joining.

3. The three nodes shutting down will then succeed (they were using
cman_tool -w leave) and try to join the cluster again with cman_tool
-w join.  Upon which time the nodes that failed to join previously
will error out with a message to the console 

    CMAN: Been in JOINWAIT for too long - giving up

  
Actual results:
cman failes to start on all all nodes

Expected results:
cman should start on all the the nodes regardless of other nodes
joining and leaving

Additional info:
I have not seen this problem yet during actual tests as I have not
been running any recovery tests lately nor have I tried to produce
this by actually rebooting nodes (where I think this may be a problem)

I think that this is a case that needs to be handled by
cman_tool/cman.ko but it is something that I might be able to
workaround in the initscripts if need be.

Comment 1 Adam "mantis" Manthei 2005-02-21 23:50:34 UTC
Created attachment 111281 [details]
bug demonstration script

Comment 2 Adam "mantis" Manthei 2005-02-21 23:53:08 UTC
Created attachment 111282 [details]
log files from test run

Comment 3 Adam "mantis" Manthei 2005-02-21 23:59:34 UTC
BTW, in the previous run there was a bug in the testscript that caused
the first node to be ignored.  (In this case, trin-04).

ccsd has also been running for quite a while as you can see without
being in quorate cluster :)

lastly, the version of /etc/init.d/cman that I am using is:
    [void] grep cman cman/init.d/CVS/Entries 
    /cman/1.1.2.8/Mon Feb 21 19:26:53 2005//TRHEL4


Comment 4 Christine Caulfield 2005-02-22 10:36:44 UTC
To be honest this is not a surprise to me. If you are continually
joining and leaving nodes then there is quite a reasonable chance that
one will be squeezed out as only one node can join the cluster at a
time and it can't do that during another nodes (up or down) transition.

cman_tool join -w waits until the node joins the cluster OR an error
occurs - its clear in this case that an error /has/ occurred. There is
no point in cman_tool waiting any longer because the node will not
join the cluster without it being invoked again.

IMHO if there is a bug at all here it is the fact that a node doesn't
join the cluster in this situation, but any customer that has
join/leave in a loop like that is probably not of sound mind ;-)

Comment 5 Adam "mantis" Manthei 2005-02-22 15:00:15 UTC
Why can't `cman_tool -w join` invoke the join again in this case?  It
seems safe enough to me to allow it to attempt to rejoin, especially
if there is a timeout feature added to the wait (Bug #149292). 
Otherwise, won't I have to put something like this in the initscript:

start()
{
.
.
.
    until cman_tool -w join
    do
        echo retrying...
    done
.
.
.
}

Does `cman_tool -w leave` have the same problem?  

Comment 6 Christine Caulfield 2005-02-22 16:40:54 UTC
With the timeout added this would make sense I suppose. Particularly
as the cman_tool binary has a better idea of the reason for the
failure and can still exit if something more drastic happens (eg out
of memory) where a retry would be inappropriate.

leave -w shouldn't be affected in the same way; the only reasons it
can fail are:
- not in the cluster
- subsystems active and "force" not requested
- in transition (fixed with -w)

For the first two there is no point in repeating the operation!

Comment 7 Christine Caulfield 2005-02-23 16:52:18 UTC
I've added the join retry to cman_tool. see #149292

If you don't think this fixes the problem then send this bug back. It
won't actually fix the underlying timeout but it's such a bizarre
pathological situation that it's hardly a major bug IMHO.

Comment 8 Adam "mantis" Manthei 2005-02-24 06:06:51 UTC
the -t parameters to cman_tool definately seem to help.  Using the
cman initscript (version 1.1.2.9), I am able to get much further in
the test run above before running into any issues.  I also am using
the cman_tool -q wait option in the script.  

This has helped tremendeously.  However, I still see a bug, not sure
if it is this one.  Somehow I managed to get all my nodes wedged in
the joining state, which prevented me from unloading the module.  I
think in your commit message for bug #149292 you mentioned the
possiblity of a cman_tool cancle operation.  Perhaps that is needed
here?  I really don't know though.  I'm going to reset my test and see
what sort of results I have in the morning.

Comment 9 Christine Caulfield 2005-02-24 08:47:01 UTC
That goes back to my original (badly explained) point. If you start
all nodes looping in join/leave there will never be a stable cluster
for anyone to join.

As soon as a joining node gets a join acknowlegement from a member
node, that member will then cease to be a member and can no longer
join the new node.

As a bug (and I won't deny it's a bug) it's almost impossible to fix,
in this architecture. It's also not a problem that (m)any customers
will hit, I hope ;-)



Note You need to log in before you can comment on or make changes to this bug.