Description of problem: I have been seeing this quite a bit lately running revolver. Revolver will shoot it's nodes, and when bringing them back up the cman join ends up dead locking. In this case, tank-03 and tank-05 were shot, they came back up, had ccsd started on them, and then cman_tool join attempted. For whatever reason, tank-01 thinks tank-02 is the master and vice versa: [root@tank-01 ~]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: tank-cluster Cluster ID: 46516 Cluster Member: Yes Membership state: State-Transition: Master is tank-02 Nodes: 3 Expected_votes: 5 Total_votes: 3 Quorum: 3 Active subsystems: 9 Node name: tank-01 Node addresses: 10.15.84.91 [root@tank-02 ~]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: tank-cluster Cluster ID: 46516 Cluster Member: Yes Membership state: State-Transition: Master is tank-01 Nodes: 3 Expected_votes: 5 Total_votes: 3 Quorum: 3 Active subsystems: 9 Node name: tank-02 Node addresses: 10.15.84.92 Membership state: Join-Wait [root@tank-03 ~]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: tank-cluster Cluster ID: 46516 Cluster Member: No Membership state: Join-Wait [root@tank-04 ~]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: tank-cluster Cluster ID: 46516 Cluster Member: Yes Membership state: State-Transition: Master is tank-01 Nodes: 3 Expected_votes: 5 Total_votes: 3 Quorum: 3 Active subsystems: 9 Node name: tank-04 Node addresses: 10.15.84.94 [root@tank-02 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 5 M tank-01 2 1 5 X tank-03 3 1 5 M tank-02 4 1 5 M tank-04 5 1 5 X tank-05 [root@tank-02 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [3 4 2 5 1] DLM Lock Space: "clvmd" 3 3 run - [3 4 2 5 1] DLM Lock Space: "corey0" 4 4 run - [3 4 2 5 1] DLM Lock Space: "corey1" 6 6 run - [3 4 2 5 1] GFS Mount Group: "corey0" 5 5 run - [3 4 2 5 1] GFS Mount Group: "corey1" 7 7 run - [3 4 2 5 1] [root@tank-01 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 2862611 Unlock operations: 2853855 Convert operations: 11399480 Completion ASTs: 17115870 Blocking ASTs: 2 Lockqueue num waittime ave WAIT_RSB 21573 99436 4 WAIT_GRANT 5606 1597 0 WAIT_UNLOCK 30 50 1 Total 27209 101083 3 [root@tank-02 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 4235278 Unlock operations: 4226584 Convert operations: 14944074 Completion ASTs: 23405844 Blocking ASTs: 115 Lockqueue num waittime ave WAIT_RSB 1197641 18662165 15 WAIT_CONV 31 501 16 WAIT_GRANT 6009 23749 3 WAIT_UNLOCK 353 3951 11 Total 1204034 18690366 15 [root@tank-04 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 574693 Unlock operations: 562742 Convert operations: 2112002 Completion ASTs: 3249411 Blocking ASTs: 18 Lockqueue num waittime ave WAIT_RSB 534095 17185688 32 WAIT_GRANT 5669 5905 1 WAIT_UNLOCK 78 1555 19 Total 539842 17193148 31 [root@tank-01 ~]# cat /proc/cluster/dlm_debug clvmd move flags 0,1,0 ids 0,2,0 clvmd move use event 2 clvmd recover event 2 (first) clvmd add nodes clvmd total nodes 5 clvmd rebuild resource directory clvmd rebuilt 0 resources clvmd recover event 2 done clvmd move flags 0,0,1 ids 0,2,2 clvmd process held requests clvmd processed 0 requests clvmd recover event 2 finished corey0 move flags 0,1,0 ids 0,3,0 corey0 move use event 3 corey0 recover event 3 (first) corey0 add nodes corey0 total nodes 5 corey0 rebuild resource directory corey0 rebuilt 5812 resources corey0 recover event 3 done corey0 move flags 0,0,1 ids 0,3,3 corey0 process held requests corey0 processed 0 requests corey0 recover event 3 finished corey1 move flags 0,1,0 ids 0,5,0 corey1 move use event 5 corey1 recover event 5 (first) corey1 add nodes corey1 total nodes 5 corey1 rebuild resource directory corey1 rebuilt 5870 resources corey1 recover event 5 done corey1 move flags 0,0,1 ids 0,5,5 corey1 process held requests corey1 processed 0 requests corey1 recover event 5 finished [root@tank-02 ~]# cat /proc/cluster/dlm_debug 00000 node -1/-1 " 7 corey0 resent 4 requests corey0 recover event 87 finished corey1 move flags 1,0,0 ids 85,85,85 corey1 move flags 0,1,0 ids 85,89,85 corey1 move use event 89 corey1 recover event 89 corey1 add node 1 corey1 total nodes 5 corey1 rebuild resource directory corey1 rebuilt 5952 resources corey1 purge requests corey1 purged 0 requests corey1 mark waiting requests corey1 mark 2be008e lq 1 nodeid -1 corey1 mark 2bb029e lq 1 nodeid -1 corey1 mark 2b20362 lq 1 nodeid -1 corey1 mark 2c70149 lq 1 nodeid -1 corey1 marked 4 requests corey1 recover event 89 done corey1 move flags 0,0,1 ids 85,89,89 corey1 process held requests corey1 processed 0 requests corey1 resend marked requests corey1 resend 2be008e lq 1 flg 200000 node -1/-1 " 7 corey1 resend 2bb029e lq 1 flg 200000 node -1/-1 " 11 corey1 resend 2b20362 lq 1 flg 200000 node -1/-1 " 7 corey1 resend 2c70149 lq 1 flg 200000 node -1/-1 " 11 corey1 resent 4 requests corey1 recover event 89 finished Version-Release number of selected component (if applicable): [root@tank-01 ~]# rpm -qa | grep cman cman-1.0-0.pre33.14 cman-kernheaders-2.6.9-34.3 cman-kernel-smp-2.6.9-34.3 How reproducible: revolver appear to always eventually hit this
This might happen if two nodes go into a CHECK transition at slightly different (but still overlapping) times. This checkin fixes that problem. I hope it also fixes this problem! Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.44.2.19; previous revision: 1.44.2.18 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-734.html