Description of problem: Every one is in the cman cluster: [root@morph-02 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 5 M morph-01 2 1 5 M morph-03 3 1 5 M morph-05 4 1 5 M morph-04 5 1 5 M morph-02 [root@morph-02 root]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Cluster-Member Nodes: 5 Expected_votes: 5 Total_votes: 5 Quorum: 3 Active subsystems: 0 Node addresses: 192.168.44.62 I then do a 'cman_tool leave' on all nodes at the same time and the cmd on the "last" node hangs. All nodes but morph-02 are no longer in the cluster: root@morph-01 root]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Not-in-Cluster [root@morph-01 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name [root@morph-03 root]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Not-in-Cluster [root@morph-03 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name [root@morph-04 root]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Not-in-Cluster [root@morph-04 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name [root@morph-05 root]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Not-in-Cluster [root@morph-05 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name But morph-02 has a different view: [root@morph-02 root]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Transition-Master Nodes: 4 Expected_votes: 5 Total_votes: 4 Quorum: 3 Active subsystems: 0 Node addresses: 192.168.44.62 [root@morph-02 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 5 X morph-01 2 1 5 M morph-03 3 1 5 M morph-05 4 1 5 M morph-04 5 1 5 M morph-02 ...and a still hung cman_tool leave cmd. All the other nodes spit out the following messages: Jan 26 17:24:38 morph-01 ccsd[3813]: Unable to connect to cluster infrastructure after 990 seconds. Version-Release number of selected component (if applicable): CMAN <CVS> (built Jan 25 2005 15:37:28) installed How reproducible: Always
How do you manage to do it "at the same time"? Every time I try it most of the nodes won't leave because they are already doing the a transition to remove the first node. In theory (ahem) this should time out once the last node notices that the rest have gone away.
I open sessions to all nodes and then use the "Send Input to All Sessions" ability from this window manager under the "View" tab. I waited quite awhile so I'm not to sure it would time out eventually. It looked pretty hung but I could wait and actually see if you wanted me to?
"Window Manager" ? "View Tab" - what are these things of which you speak? is that anything like a screen session ? The nearest I can get is screen's :at bench# stuff 'cman_tool leave'\012 which still isn't quick enough to catch the others out. If you've waited more than a couple of minutes and it's not timed out then I suspect it's not going to. The worst case is TRANSITION_RESTARTS*TRANSITION_TIMER (10x15 seconds, 2.5 minutes). So it looks like the transition timer probably isn't firing.
Ok, I've managed to reproduce this with a slightly hacked up cnxman.c (rip the transition check out of the ioctl code). I need to run some more tests over the weekend. The last node will still take a couple of minutes to die but it's such an odd circumstance that I'm not going to lose any sleep over it. What is really needed here is something like VMS's CLUSTER_SHUTDOWN option, but that will have to wait.
heartbeat thread didn't take any notice of the "quit_threads" flag relying instead on it's friends to shut it down. This was not reliable when we were the last node out of a cluster. Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.57; previous revision: 1.56 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.44.2.7; previous revision: 1.44.2.6 done
still seeing this, although not as often.
Take 2, There were places where threads could have been blocked waiting for things to happen that just were never going to. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.48; previous revision: 1.47 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.59; previous revision: 1.58 done RHEL4 branch: Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.6; previous revision: 1.42.2.5 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.44.2.8; previous revision: 1.44.2.7 done
fix verified.