Red Hat Bugzilla – Bug 149288
cman_tool: Error waiting for cluster
Last modified: 2009-04-24 10:35:33 EDT
Description of problem:
I'm seeing the error "cman_tool: Error waiting for cluster" when
running "cman_tool -w join" while the cluster has other nodes logging
in and out of it. This can cause machines to fail to start cman on
startup which means that they will also fail to start clvmd and GFS.
Version-Release number of selected component (if applicable):
[root@trin-04 ~]# rpm -qa | grep cman
I can reproduce this evertime with my test script that I'll be soon
Steps to Reproduce:
1. I have six nodes in my cluster. on each, I run:
while : ; do
service cman start || break
service cman stop || break
2. 3 nodes typically join the cluster (the cman initscript uses
`cman_tool -w join`) and start shutting down before the other three
3. The three nodes shutting down will then succeed (they were using
cman_tool -w leave) and try to join the cluster again with cman_tool
-w join. Upon which time the nodes that failed to join previously
will error out with a message to the console
CMAN: Been in JOINWAIT for too long - giving up
cman failes to start on all all nodes
cman should start on all the the nodes regardless of other nodes
joining and leaving
I have not seen this problem yet during actual tests as I have not
been running any recovery tests lately nor have I tried to produce
this by actually rebooting nodes (where I think this may be a problem)
I think that this is a case that needs to be handled by
cman_tool/cman.ko but it is something that I might be able to
workaround in the initscripts if need be.
Created attachment 111281 [details]
bug demonstration script
Created attachment 111282 [details]
log files from test run
BTW, in the previous run there was a bug in the testscript that caused
the first node to be ignored. (In this case, trin-04).
ccsd has also been running for quite a while as you can see without
being in quorate cluster :)
lastly, the version of /etc/init.d/cman that I am using is:
[void] grep cman cman/init.d/CVS/Entries
/cman/18.104.22.168/Mon Feb 21 19:26:53 2005//TRHEL4
To be honest this is not a surprise to me. If you are continually
joining and leaving nodes then there is quite a reasonable chance that
one will be squeezed out as only one node can join the cluster at a
time and it can't do that during another nodes (up or down) transition.
cman_tool join -w waits until the node joins the cluster OR an error
occurs - its clear in this case that an error /has/ occurred. There is
no point in cman_tool waiting any longer because the node will not
join the cluster without it being invoked again.
IMHO if there is a bug at all here it is the fact that a node doesn't
join the cluster in this situation, but any customer that has
join/leave in a loop like that is probably not of sound mind ;-)
Why can't `cman_tool -w join` invoke the join again in this case? It
seems safe enough to me to allow it to attempt to rejoin, especially
if there is a timeout feature added to the wait (Bug #149292).
Otherwise, won't I have to put something like this in the initscript:
until cman_tool -w join
Does `cman_tool -w leave` have the same problem?
With the timeout added this would make sense I suppose. Particularly
as the cman_tool binary has a better idea of the reason for the
failure and can still exit if something more drastic happens (eg out
of memory) where a retry would be inappropriate.
leave -w shouldn't be affected in the same way; the only reasons it
can fail are:
- not in the cluster
- subsystems active and "force" not requested
- in transition (fixed with -w)
For the first two there is no point in repeating the operation!
I've added the join retry to cman_tool. see #149292
If you don't think this fixes the problem then send this bug back. It
won't actually fix the underlying timeout but it's such a bizarre
pathological situation that it's hardly a major bug IMHO.
the -t parameters to cman_tool definately seem to help. Using the
cman initscript (version 22.214.171.124), I am able to get much further in
the test run above before running into any issues. I also am using
the cman_tool -q wait option in the script.
This has helped tremendeously. However, I still see a bug, not sure
if it is this one. Somehow I managed to get all my nodes wedged in
the joining state, which prevented me from unloading the module. I
think in your commit message for bug #149292 you mentioned the
possiblity of a cman_tool cancle operation. Perhaps that is needed
here? I really don't know though. I'm going to reset my test and see
what sort of results I have in the morning.
That goes back to my original (badly explained) point. If you start
all nodes looping in join/leave there will never be a stable cluster
for anyone to join.
As soon as a joining node gets a join acknowlegement from a member
node, that member will then cease to be a member and can no longer
join the new node.
As a bug (and I won't deny it's a bug) it's almost impossible to fix,
in this architecture. It's also not a problem that (m)any customers
will hit, I hope ;-)