From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040626 Firefox/0.9.1 Description of problem: For the first time in quite a while I had one of my four nodes form its own cluster while the other three formed another. I was just running my usual cluster startup script. There was no immediate sign that something was wrong until SM error messages started appearing. Version-Release number of selected component (if applicable): How reproducible: Couldn't Reproduce Steps to Reproduce: This occurs very rarely on my 4 node cluster. I have all 4 nodes run "cman_tool join -c delta" in parallel Actual Results: one node has formed cluster "delta" on its own and the other three nodes have formed cluster "delta" together Expected Results: all four nodes form a single cluster Additional info:
It looks like the delay calculated on receipt of a NEWCLUSTER message could occasionally be higher than the joinwait time which would cause a node to wait too long before trying again, thus the other nodes would have given up and formed a new cluster. I've fixed this exposure and also increased the joinwait timeout to be slightly longer too. Checking in config.c; /cvs/cluster/cluster/cman-kernel/src/config.c,v <-- config.c new revision: 1.2; previous revision: 1.1 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.3; previous revision: 1.2 done
This obviously needs more work...
This should work better. Based on ideas from Dave Checking in src/config.c; /cvs/cluster/cluster/cman-kernel/src/config.c,v <-- config.c new revision: 1.3; previous revision: 1.2 done Checking in src/config.h; /cvs/cluster/cluster/cman-kernel/src/config.h,v <-- config.h new revision: 1.2; previous revision: 1.1 done Checking in src/membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.15; previous revision: 1.14 done Checking in src/proc.c; /cvs/cluster/cluster/cman-kernel/src/proc.c,v <-- proc.c new revision: 1.3; previous revision: 1.2 done
It's better but still not perfect (and, we expect perfect, don't we?) I have 7 node cluster. Before this last update, I used to have 4-5 "clusters" formed in parallel with node or two in it. After update, there are 2-3 "clusters", which still isn't good. As a workaround, I have placed "sleep $((RANDOM / 1000 ))s" into my startup script, which somewhat helps the parallel startup situation, but slows down boot process.
There was a missing condition in that original check-in that made it little better than the original. I've corrected this now and I think it should be fixed. I'll wait for Lazar to confirm before changing the status of this bug report though.
No response from Lazar, but he's not said it's still broken :) It seems OK to me on my 12 node cluster now, so setting it to MODIFIED for the moment.
For info, Lazar said (on IRC) that he hasn't seen this bug since the last fix was applied.
Updating version to the right level in the defects. Sorry for the storm.
not seen this in a long time