Description of problem: I'm seeing the error "cman_tool: Error waiting for cluster" when running "cman_tool -w join" while the cluster has other nodes logging in and out of it. This can cause machines to fail to start cman on startup which means that they will also fail to start clvmd and GFS. Version-Release number of selected component (if applicable): [root@trin-04 ~]# rpm -qa | grep cman cman-kernheaders-2.6.9-18.0 cman-1.0-0.pre23.1 cman-kernel-2.6.9-21.0 How reproducible: I can reproduce this evertime with my test script that I'll be soon attaching Steps to Reproduce: 1. I have six nodes in my cluster. on each, I run: while : ; do service cman start || break service cman stop || break done 2. 3 nodes typically join the cluster (the cman initscript uses `cman_tool -w join`) and start shutting down before the other three finish joining. 3. The three nodes shutting down will then succeed (they were using cman_tool -w leave) and try to join the cluster again with cman_tool -w join. Upon which time the nodes that failed to join previously will error out with a message to the console CMAN: Been in JOINWAIT for too long - giving up Actual results: cman failes to start on all all nodes Expected results: cman should start on all the the nodes regardless of other nodes joining and leaving Additional info: I have not seen this problem yet during actual tests as I have not been running any recovery tests lately nor have I tried to produce this by actually rebooting nodes (where I think this may be a problem) I think that this is a case that needs to be handled by cman_tool/cman.ko but it is something that I might be able to workaround in the initscripts if need be.
Created attachment 111281 [details] bug demonstration script
Created attachment 111282 [details] log files from test run
BTW, in the previous run there was a bug in the testscript that caused the first node to be ignored. (In this case, trin-04). ccsd has also been running for quite a while as you can see without being in quorate cluster :) lastly, the version of /etc/init.d/cman that I am using is: [void] grep cman cman/init.d/CVS/Entries /cman/1.1.2.8/Mon Feb 21 19:26:53 2005//TRHEL4
To be honest this is not a surprise to me. If you are continually joining and leaving nodes then there is quite a reasonable chance that one will be squeezed out as only one node can join the cluster at a time and it can't do that during another nodes (up or down) transition. cman_tool join -w waits until the node joins the cluster OR an error occurs - its clear in this case that an error /has/ occurred. There is no point in cman_tool waiting any longer because the node will not join the cluster without it being invoked again. IMHO if there is a bug at all here it is the fact that a node doesn't join the cluster in this situation, but any customer that has join/leave in a loop like that is probably not of sound mind ;-)
Why can't `cman_tool -w join` invoke the join again in this case? It seems safe enough to me to allow it to attempt to rejoin, especially if there is a timeout feature added to the wait (Bug #149292). Otherwise, won't I have to put something like this in the initscript: start() { . . . until cman_tool -w join do echo retrying... done . . . } Does `cman_tool -w leave` have the same problem?
With the timeout added this would make sense I suppose. Particularly as the cman_tool binary has a better idea of the reason for the failure and can still exit if something more drastic happens (eg out of memory) where a retry would be inappropriate. leave -w shouldn't be affected in the same way; the only reasons it can fail are: - not in the cluster - subsystems active and "force" not requested - in transition (fixed with -w) For the first two there is no point in repeating the operation!
I've added the join retry to cman_tool. see #149292 If you don't think this fixes the problem then send this bug back. It won't actually fix the underlying timeout but it's such a bizarre pathological situation that it's hardly a major bug IMHO.
the -t parameters to cman_tool definately seem to help. Using the cman initscript (version 1.1.2.9), I am able to get much further in the test run above before running into any issues. I also am using the cman_tool -q wait option in the script. This has helped tremendeously. However, I still see a bug, not sure if it is this one. Somehow I managed to get all my nodes wedged in the joining state, which prevented me from unloading the module. I think in your commit message for bug #149292 you mentioned the possiblity of a cman_tool cancle operation. Perhaps that is needed here? I really don't know though. I'm going to reset my test and see what sort of results I have in the morning.
That goes back to my original (badly explained) point. If you start all nodes looping in join/leave there will never be a stable cluster for anyone to join. As soon as a joining node gets a join acknowlegement from a member node, that member will then cease to be a member and can no longer join the new node. As a bug (and I won't deny it's a bug) it's almost impossible to fix, in this architecture. It's also not a problem that (m)any customers will hit, I hope ;-)