Description of problem: After being shot during a revolver run, the three nodes that got shot hung during startup in the clvmd init script. Version-Release number of selected component (if applicable): lvm2-cluster-2.02.06-1.0.RHEL4.i386 GFS-kernel-hugemem-2.6.9-57.0.i686 cman-kernel-hugemem-2.6.9-44.7.i686 cmirror-kernel-hugemem-2.6.9-5.3.i686 dlm-kernel-hugemem-2.6.9-42.7.i686 How reproducible: Unknown Steps to Reproduce: 1. Start dlm cluster with GFS file system 2. Start load on all nodes 3. reboot -fin on a node or two. Actual results: Nodes hung during restart: Starting ccsd: [ OK ] Starting cman:CMAN 2.6.9-44.7 (built Jun 2 2006 17:49:45) installed DLM 2.6.9-42.7 (built Jun 2 2006 17:55:25) installed [FAILED] Loading clustered mirror log:dm_cmirror: disagrees about version of symbol struct_module FATAL: Error inserting dm_cmirror (/lib/modules/2.6.9-39.ELhugemem/kernel/cluster/dm-cmirror.ko): Invalid module format [FAILED] Starting lock_gulmd:[WARNING] Starting fence domain:[FAILED] Starting clvmd: Expected results: Nodes should restart without issues. Additional info:
It appears I hit split brain. [root@morph-02 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 5 X morph-01 2 1 5 M morph-02 3 1 5 X morph-05 4 1 5 X morph-03 5 1 5 M morph-04 [root@morph-02 ~]# cman_tool status Protocol version: 5.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Cluster Member: Yes Membership state: State-Transition: Master is morph-04 Nodes: 2 Expected_votes: 5 Total_votes: 2 Quorum: 3 Active subsystems: 12 Node name: morph-02 Node ID: 2 Node addresses: 10.15.89.62 [root@morph-04 ~]# cman_tool status Protocol version: 5.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Cluster Member: Yes Membership state: State-Transition: Master is morph-02 Nodes: 2 Expected_votes: 5 Total_votes: 2 Quorum: 3 Active subsystems: 12 Node name: morph-04 Node ID: 5 Node addresses: 10.15.89.64
Created attachment 130808 [details] Proposed patch to fix Are you in a position to test a patch ? The attached works for me, but then I can't reproduce it anwyay.
I have the time if you can build me a cman-kernel package for i686.
OK, get one from http://people.redhat.com/pcaulfie/
I tried the cman-kernel-smp package and was still able to make the clvmd init script to hang. Unfortunately all nodes are back in the cluster and the cluster is quorate. Time to move sshd and xinetd up in the boot order.
I made sshd start earlier in the boot sequence and was able to reproduce the clvmd hang again. All nodes are in the cluster so I'm not going to worry about the cman problems I listed above. On the nodes where clvmd was hung I found cman_userjoin and clvmd in D state. 2620 ? D 0:00 [cman_userjoin] 2629 ? Ds 0:00 clvmd I need to get the kernel-debuginfo installed so I can use crash. I can't attach to the hung clvmd process with gdb.
The output from crash looks totally useless :( crash> bt 2630 PID: 2630 TASK: f5d1aeb0 CPU: 0 COMMAND: "clvmd" #0 [f5aebe14] schedule at c02d20e5 #1 [f5aebe78] wait_for_completion at c02d2212 #2 [f5aebecc] kcl_join_service at f892b210 #3 [f5aebee4] new_lockspace at f89a853a #4 [f5aebf0c] dlm_new_lockspace at f89a8613 #5 [f5aebf28] register_lockspace at f89a22c5 #6 [f5aebf50] do_user_create_lockspace at f89a2b03 #7 [f5aebf5c] dlm_write at f89a36b2 #8 [f5aebf88] vfs_write at c015b08d #9 [f5aebfa4] sys_write at c015b155 #10 [f5aebfc0] system_call at c02d4218 EAX: 00000004 EBX: 00000007 ECX: bff4eb90 EDX: 0000006d DS: 007b ESI: 080922d2 ES: 007b EDI: 080922d7 SS: 007b ESP: bff4eb74 EBP: bff4fcd8 CS: 0073 EIP: 004c57a2 ERR: 00000004 EFLAGS: 00000246 crash> bt 2621 PID: 2621 TASK: f5def830 CPU: 0 COMMAND: "cman_userjoin" #0 [f5b24f0c] schedule at c02d20e5 #1 [f5b24f70] wait_for_completion at c02d2212 #2 [f5b24fc4] kcl_join_service at f892b210 #3 [f5b24fdc] user_join_async at f892bb6c #4 [f5b24ff0] kernel_thread_helper at c01041f3 Any thoughts on where to go with this? I can reproduce it reliably, but it takes an hour or so to hit. I have access to the machines while hung via ssh with crash available. I would like to know what to try next.
That's a totally different bug to the last one if all the nodes are joining. Can you post the output of "cman_tool services" and /proc/cluster/dlm_debug? Also are you sure it's not just clvmd waiting for the cluster to become quorate (sorry, but I have to ask)?
I think there really are two different bugs here. 1. the clvmd hang 2. the weird cman state. 2 doesn't always occur when 1 happens. Here is the output of cman_tool services for each node: morph-01:: Service Name GID LID State Code Fence Domain: "default" 11 2 run U-1,10,3 [2 1] DLM Lock Space: "clvmd" 12 3 run U-1,10,4 [2 1] DLM Lock Space: "morph-cluster0" 14 5 run - [2 1] DLM Lock Space: "morph-cluster1" 16 7 run - [2 1] DLM Lock Space: "morph-cluster2" 18 9 run - [2 1] GFS Mount Group: "morph-cluster0" 15 6 run - [2 1] GFS Mount Group: "morph-cluster1" 17 8 run - [2 1] GFS Mount Group: "morph-cluster2" 19 10 run - [2 1] User: "usrm::manager" 13 4 recover 4 - [2 1] morph-02:: Service Name GID LID State Code Fence Domain: "default" 11 2 run U-1,10,3 [2 1] DLM Lock Space: "clvmd" 12 3 run U-1,10,4 [2 1] DLM Lock Space: "morph-cluster0" 14 5 run - [2 1] DLM Lock Space: "morph-cluster1" 16 7 run - [2 1] DLM Lock Space: "morph-cluster2" 18 9 run - [2 1] GFS Mount Group: "morph-cluster0" 15 6 run - [2 1] GFS Mount Group: "morph-cluster1" 17 8 run - [2 1] GFS Mount Group: "morph-cluster2" 19 10 run - [2 1] User: "usrm::manager" 13 4 recover 2 - [2 1] morph-03:: Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,5 [] DLM Lock Space: "clvmd" 12 3 join S-4,4,1 [2 1 4] morph-04:: Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,5 [] DLM Lock Space: "clvmd" 0 3 join S-1,80,5 [] morph-05:: Service Name GID LID State Code Fence Domain: "default" 11 2 join S-4,4,1 [2 1 3] DLM Lock Space: "clvmd" 0 3 join S-1,80,5 [] dlm_debug contents: morph-01:: morph-cluster1 resend 1b02d4 lq 1 flg 200000 node -1/-1 " morph-cluster1 resent 5 requests morph-cluster1 recover event 31 finished morph-cluster1 grant lock on lockqueue 3 morph-cluster1 process_lockqueue_reply id 540119 state 0 morph-cluster1 grant lock on lockqueue 3 morph-cluster1 process_lockqueue_reply id 4e0068 state 0 morph-cluster1 grant lock on lockqueue 3 morph-cluster1 process_lockqueue_reply id 5400b8 state 0 morph-cluster1 grant lock on lockqueue 3 morph-cluster1 process_lockqueue_reply id 68031e state 0 morph-cluster1 grant lock on lockqueue 3 morph-cluster1 process_lockqueue_reply id 480137 state 0 morph-cluster1 grant lock on lockqueue 3 morph-cluster1 process_lockqueue_reply id 580152 state 0 morph-cluster1 grant lock on lockqueue 3 morph-cluster1 process_lockqueue_reply id 600110 state 0 morph-cluster1 grant lock on lockqueue 3 morph-cluster1 process_lockqueue_reply id 600200 state 0 morph-cluster2 grant lock on lockqueue 3 morph-cluster2 process_lockqueue_reply id 3c0341 state 0 morph-02:: f2 lq 1 nodeid -1 morph-cluster1 marked 2 requests morph-cluster1 purge locks of departed nodes morph-cluster0 rebuilt 3 locks morph-cluster1 purged 3728 locks morph-cluster1 update remastered resources morph-cluster0 recover event 57 done morph-cluster1 updated 3 resources morph-cluster1 rebuild locks morph-cluster1 rebuilt 3 locks morph-cluster1 recover event 57 done morph-cluster0 move flags 0,0,1 ids 51,57,57 morph-cluster0 process held requests morph-cluster0 processed 0 requests morph-cluster0 resend marked requests morph-cluster0 resend 6902c4 lq 1 flg 200000 node -1/-1 " morph-cluster0 resent 1 requests morph-cluster1 move flags 0,0,1 ids 53,57,57 morph-cluster1 process held requests morph-cluster1 processed 0 requests morph-cluster1 resend marked requests morph-cluster1 resend 8601f4 lq 1 flg 200000 node -1/-1 " morph-cluster1 resend 8d03f2 lq 1 flg 200000 node -1/-1 " morph-cluster1 resent 2 requests morph-cluster1 recover event 57 finished morph-cluster0 recover event 57 finished morph-03:: <empty> morph-04:: <empty> morph-05:: <empty> All nodes are in the cluster (in this case) so I'm assuming it's quorate. But in this case 2. did not happen.
Yes, this is certainly two bugs. If you're not seeing the cman split any more (2) then I'll commit that patch for U4. The other looks like it might possibly be fencing related. Are there any fencing messages on the remaining nodes ?
I haven't run the test case enough to see if (2) still happens some of the time. There weren't as many fence messages as I expected. Here are the parts from /var/log/messages on morph-01 and morph-02, the surviving nodes, just after the other nodes were rebooted with "reboot -fin." morph-01:: Jun 14 13:22:46 morph-01 qarshd[4413]: Running cmdline: tcpdump -s0 -x port 6809 Jun 14 13:22:46 morph-01 kernel: device eth0 entered promiscuous mode Jun 14 13:23:06 morph-01 kernel: CMAN: removing node morph-03 from the cluster : Missed too many heartbeats Jun 14 13:23:12 morph-01 kernel: CMAN: removing node morph-05 from the cluster : No response to messages Jun 14 13:23:24 morph-01 kernel: CMAN: node morph-04 has been removed from the c luster : No response to messages Jun 14 13:23:34 morph-01 kernel: CMAN: quorum lost, blocking activity Jun 14 13:23:34 morph-01 clurgmgrd[3239]: <emerg> #1: Quorum Dissolved Jun 14 13:23:37 morph-01 ccsd[2657]: Cluster is not quorate. Refusing connectio n. Jun 14 13:23:37 morph-01 ccsd[2657]: Error while processing connect: Connection refused Jun 14 13:23:47 morph-01 ccsd[2657]: Cluster is not quorate. Refusing connectio n. Jun 14 13:23:47 morph-01 ccsd[2657]: Error while processing connect: Connection refused Jun 14 13:23:48 morph-01 qarshd[4424]: Talking to peer 10.15.80.50:50338 Jun 14 13:23:48 morph-01 qarshd[4424]: Running cmdline: cat /proc/cluster/nodes Jun 14 13:23:53 morph-01 qarshd[4426]: Talking to peer 10.15.80.50:50340 Jun 14 13:23:53 morph-01 qarshd[4426]: Running cmdline: cat /proc/cluster/nodes Jun 14 13:23:57 morph-01 ccsd[2657]: Cluster is not quorate. Refusing connectio n. Jun 14 13:23:57 morph-01 ccsd[2657]: Error while processing connect: Connection refused Jun 14 13:23:58 morph-01 qarshd[4429]: Talking to peer 10.15.80.50:50342 Jun 14 13:23:58 morph-01 qarshd[4429]: Running cmdline: cat /proc/cluster/nodes Jun 14 13:24:07 morph-01 ccsd[2657]: Cluster is not quorate. Refusing connectio n. Jun 14 13:24:07 morph-01 ccsd[2657]: Error while processing connect: Connection refused Jun 14 13:24:10 morph-01 kernel: CMAN: node morph-03 rejoining Jun 14 13:24:11 morph-01 kernel: CMAN: quorum regained, resuming activity Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <notice> Quorum Achieved Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> Magma Event: Membership Change Jun 14 13:24:11 morph-01 ccsd[2657]: Cluster is quorate. Allowing connections. Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> State change: Local UP Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> State change: morph-02 UP Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> State change: morph-03 UP Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> Loading Service Data Jun 14 13:24:12 morph-01 clurgmgrd[3239]: <info> Initializing Services Jun 14 13:24:25 morph-01 kernel: CMAN: node morph-05 rejoining Jun 14 13:24:35 morph-01 kernel: CMAN: node morph-04 rejoining morph-02:: Jun 14 13:22:54 morph-02 qarshd[4561]: Running cmdline: tcpdump -s0 -x port 6809 Jun 14 13:23:13 morph-02 kernel: CMAN: node morph-03 has been removed from the c luster : Missed too many heartbeats Jun 14 13:23:24 morph-02 kernel: CMAN: removing node morph-05 from the cluster : No response to messages Jun 14 13:23:30 morph-02 kernel: CMAN: removing node morph-04 from the cluster : No response to messages Jun 14 13:23:41 morph-02 clurgmgrd[3192]: <emerg> #1: Quorum Dissolved Jun 14 13:23:41 morph-02 kernel: CMAN: quorum lost, blocking activity Jun 14 13:23:55 morph-02 qarshd[4569]: Talking to peer 10.15.80.50:36362 Jun 14 13:23:55 morph-02 qarshd[4569]: Running cmdline: cat /proc/cluster/nodes Jun 14 13:24:00 morph-02 qarshd[4571]: Talking to peer 10.15.80.50:36364 Jun 14 13:24:00 morph-02 qarshd[4571]: Running cmdline: cat /proc/cluster/nodes Jun 14 13:24:05 morph-02 qarshd[4573]: Talking to peer 10.15.80.50:36366 Jun 14 13:24:05 morph-02 qarshd[4573]: Running cmdline: cat /proc/cluster/nodes Jun 14 13:24:17 morph-02 kernel: CMAN: node morph-03 rejoining Jun 14 13:24:18 morph-02 ccsd[2573]: Cluster is quorate. Allowing connections. Jun 14 13:24:18 morph-02 kernel: CMAN: quorum regained, resuming activity Jun 14 13:24:18 morph-02 fenced[2648]: fencing deferred to morph-01 Jun 14 13:24:32 morph-02 kernel: CMAN: node morph-05 rejoining Jun 14 13:24:42 morph-02 kernel: CMAN: node morph-04 rejoining I still have the tcpdump for these two nodes, would you like to see it for the time slice from 13:22:54 to 13:24:42?
No thanks. I don't need another tcpdump, unless the cman problem recurs. Odd about the fence messages, morph-01 should be IC fencing but there don't seem to be any messages about it. I think we need to see what fenced is doing here.
I'm not sure how you want me to do this. I'm trying to run fenced with the -D option, but I can't get it to stay up reliably with that option. Also, I don't see anything useful in the output. Here's the output from fenced on morph-04 when it was deferred to. fenced: 1150471093 members: fenced: 1150471093 nodeid = 1 "morph-04" fenced: 1150471093 nodeid = 2 "morph-05" fenced: 1150471093 nodeid = 3 "morph-02" fenced: 1150471093 nodeid = 4 "morph-01" fenced: 1150471093 nodeid = 5 "morph-03" fenced: 1150471093 do_recovery stop 70 start 71 finish 70 fenced: 1150471093 finish: fenced: 1150471093 event_id = 71 fenced: 1150471093 last_stop = 70 fenced: 1150471093 last_start = 71 fenced: 1150471093 last_finish = 71 fenced: 1150471093 node_count = 0 fenced: 1150471714 stop: fenced: 1150471714 event_id = 0 fenced: 1150471714 last_stop = 71 fenced: 1150471714 last_start = 71 fenced: 1150471714 last_finish = 71 fenced: 1150471714 node_count = 0 fenced: 1150471748 start: fenced: 1150471748 event_id = 96 fenced: 1150471748 last_stop = 71 fenced: 1150471748 last_start = 96 fenced: 1150471748 last_finish = 71 fenced: 1150471748 node_count = 2 fenced: 1150471748 start_type = failed fenced: 1150471748 members: fenced: 1150471748 nodeid = 1 "morph-04" fenced: 1150471748 nodeid = 5 "morph-03" fenced: 1150471748 do_recovery stop 71 start 96 finish 71 fenced: 1150471748 add node 2 to list 0 fenced: 1150471748 add node 3 to list 0 fenced: 1150471748 add node 4 to list 0 fenced: 1150471749 reduce victim morph-01 fenced: 1150471763 reduce victim morph-02 fenced: 1150471778 reduce victim morph-05 fenced: 1150471778 delay of 30s leaves 0 victims fenced: 1150471778 finish: fenced: 1150471778 event_id = 96 fenced: 1150471778 last_stop = 71 fenced: 1150471778 last_start = 96 fenced: 1150471778 last_finish = 96 fenced: 1150471778 node_count = 0 fenced: 1150471783 stop: fenced: 1150471783 event_id = 0 fenced: 1150471783 last_stop = 96 fenced: 1150471783 last_start = 96 fenced: 1150471783 last_finish = 96 fenced: 1150471783 node_count = 0 fenced: 1150471783 start: fenced: 1150471783 event_id = 97 fenced: 1150471783 last_stop = 96 fenced: 1150471783 last_start = 97 fenced: 1150471783 last_finish = 96 fenced: 1150471783 node_count = 3 fenced: 1150471783 start_type = join fenced: 1150471783 members: fenced: 1150471783 nodeid = 1 "morph-04" fenced: 1150471783 nodeid = 5 "morph-03" fenced: 1150471783 nodeid = 2 "morph-05" fenced: 1150471783 do_recovery stop 96 start 97 finish 96 fenced: 1150471783 finish: fenced: 1150471783 event_id = 97 fenced: 1150471783 last_stop = 96 fenced: 1150471783 last_start = 97 fenced: 1150471783 last_finish = 97 fenced: 1150471783 node_count = 0 fenced: 1150471784 stop: fenced: 1150471784 event_id = 0 fenced: 1150471784 last_stop = 97 fenced: 1150471784 last_start = 97 fenced: 1150471784 last_finish = 97 fenced: 1150471784 node_count = 0 fenced: 1150471784 start: fenced: 1150471784 event_id = 99 fenced: 1150471784 last_stop = 97 fenced: 1150471784 last_start = 99 fenced: 1150471784 last_finish = 97 fenced: 1150471784 node_count = 4 fenced: 1150471784 start_type = join fenced: 1150471784 members: fenced: 1150471784 nodeid = 1 "morph-04" fenced: 1150471784 nodeid = 5 "morph-03" fenced: 1150471784 nodeid = 2 "morph-05" fenced: 1150471784 nodeid = 3 "morph-02" fenced: 1150471784 do_recovery stop 97 start 99 finish 97 fenced: 1150471784 finish: fenced: 1150471784 event_id = 99 fenced: 1150471784 last_stop = 97 fenced: 1150471784 last_start = 99 fenced: 1150471784 last_finish = 99 fenced: 1150471784 node_count = 0 fenced: 1150471787 stop: fenced: 1150471787 event_id = 0 fenced: 1150471787 last_stop = 99 fenced: 1150471787 last_start = 99 fenced: 1150471787 last_finish = 99 fenced: 1150471787 node_count = 0 fenced: 1150471787 start: fenced: 1150471787 event_id = 101 fenced: 1150471787 last_stop = 99 fenced: 1150471787 last_start = 101 fenced: 1150471787 last_finish = 99 fenced: 1150471787 node_count = 5 fenced: 1150471787 start_type = join fenced: 1150471787 members: fenced: 1150471787 nodeid = 1 "morph-04" fenced: 1150471787 nodeid = 5 "morph-03" fenced: 1150471787 nodeid = 2 "morph-05" fenced: 1150471787 nodeid = 3 "morph-02" fenced: 1150471787 nodeid = 4 "morph-01" fenced: 1150471787 do_recovery s ^[[A ^[[A top 99 start 101 finish 99 fenced: 1150471787 finish: fenced: 1150471787 event_id = 101 fenced: 1150471787 last_stop = 99 fenced: 1150471787 last_start = 101 fenced: 1150471787 last_finish = 101 fenced: 1150471787 node_count = 0 fenced: 1150471869 stop: fenced: 1150471869 event_id = 0 fenced: 1150471869 last_stop = 101 fenced: 1150471869 last_start = 101 fenced: 1150471869 last_finish = 101 fenced: 1150471869 node_count = 0 fenced: 1150471869 start: fenced: 1150471869 event_id = 124 fenced: 1150471869 last_stop = 101 fenced: 1150471869 last_start = 124 fenced: 1150471869 last_finish = 101 fenced: 1150471869 node_count = 4 fenced: 1150471869 start_type = leave fenced: 1150471869 members: fenced: 1150471869 nodeid = 1 "morph-04" fenced: 1150471869 nodeid = 5 "morph-03" fenced: 1150471869 nodeid = 2 "morph-05" fenced: 1150471869 nodeid = 3 "morph-02" fenced: 1150471869 do_recovery stop 101 start 124 finish 101 fenced: 1150471869 add node 4 to list 2 fenced: 1150471869 finish: fenced: 1150471869 event_id = 124 fenced: 1150471869 last_stop = 101 fenced: 1150471869 last_start = 124 fenced: 1150471869 last_finish = 124 fenced: 1150471869 node_count = 0 fenced: 1150471884 stop: fenced: 1150471884 event_id = 0 fenced: 1150471884 last_stop = 124 fenced: 1150471884 last_start = 124 fenced: 1150471884 last_finish = 124 fenced: 1150471884 node_count = 0 fenced: 1150471884 start: fenced: 1150471884 event_id = 125 fenced: 1150471884 last_stop = 124 fenced: 1150471884 last_start = 125 fenced: 1150471884 last_finish = 124 fenced: 1150471884 node_count = 3 fenced: 1150471884 start_type = leave fenced: 1150471884 members: fenced: 1150471884 nodeid = 1 "morph-04" fenced: 1150471884 nodeid = 5 "morph-03" fenced: 1150471884 nodeid = 2 "morph-05" fenced: 1150471884 do_recovery stop 124 start 125 finish 124 fenced: 1150471884 add node 3 to list 2 fenced: 1150471884 finish: fenced: 1150471884 event_id = 125 fenced: 1150471884 last_stop = 124 fenced: 1150471884 last_start = 125 fenced: 1150471884 last_finish = 125 fenced: 1150471884 node_count = 0 fenced: 1150471899 stop: fenced: 1150471899 event_id = 0 fenced: 1150471899 last_stop = 125 fenced: 1150471899 last_start = 125 fenced: 1150471899 last_finish = 125 fenced: 1150471899 node_count = 0 fenced: 1150471899 start: fenced: 1150471899 event_id = 126 fenced: 1150471899 last_stop = 125 fenced: 1150471899 last_start = 126 fenced: 1150471899 last_finish = 125 fenced: 1150471899 node_count = 2 fenced: 1150471899 start_type = leave fenced: 1150471899 members: fenced: 1150471899 nodeid = 1 "morph-04" fenced: 1150471899 nodeid = 5 "morph-03" fenced: 1150471899 do_recovery stop 125 start 126 finish 125 fenced: 1150471899 add node 2 to list 2 fenced: 1150471899 finish: fenced: 1150471899 event_id = 126 fenced: 1150471899 last_stop = 125 fenced: 1150471899 last_start = 126 fenced: 1150471899 last_finish = 126 fenced: 1150471899 node_count = 0 fenced: 1150472130 stop: fenced: 1150472130 event_id = 0 fenced: 1150472130 last_stop = 126 fenced: 1150472130 last_start = 126 fenced: 1150472130 last_finish = 126 fenced: 1150472130 node_count = 0 fenced: 1150472130 start: fenced: 1150472130 event_id = 127 fenced: 1150472130 last_stop = 126 fenced: 1150472130 last_start = 127 fenced: 1150472130 last_finish = 126 fenced: 1150472130 node_count = 3 fenced: 1150472130 start_type = join fenced: 1150472130 members: fenced: 1150472130 nodeid = 1 "morph-04" fenced: 1150472130 nodeid = 5 "morph-03" fenced: 1150472130 nodeid = 4 "morph-01" fenced: 1150472130 do_recovery stop 126 start 127 finish 126 fenced: 1150472130 finish: fenced: 1150472130 event_id = 127 fenced: 1150472130 last_stop = 126 fenced: 1150472130 last_start = 127 fenced: 1150472130 last_finish = 127 fenced: 1150472130 node_count = 0 fenced: 1150472137 stop: fenced: 1150472137 event_id = 0 fenced: 1150472137 last_stop = 127 fenced: 1150472137 last_start = 127 fenced: 1150472137 last_finish = 127 fenced: 1150472137 node_count = 0 fenced: 1150472137 start: fenced: 1150472137 event_id = 128 fenced: 1150472137 last_stop = 127 fenced: 1150472137 last_start = 128 fenced: 1150472137 last_finish = 127 fenced: 1150472137 node_count = 4 fenced: 1150472137 start_type = join fenced: 1150472137 members: fenced: 1150472137 nodeid = 1 "morph-04" fenced: 1150472137 nodeid = 5 "morph-03" fenced: 1150472137 nodeid = 4 "morph-01" fenced: 1150472137 nodeid = 3 "morph-02" fenced: 1150472137 do_recovery stop 127 start 128 finish 127 fenced: 1150472137 finish: fenced: 1150472137 event_id = 128 fenced: 1150472137 last_stop = 127 fenced: 1150472137 last_start = 128 fenced: 1150472137 last_finish = 128 fenced: 1150472137 node_count = 0 fenced: 1150472176 stop: fenced: 1150472176 event_id = 0 fenced: 1150472176 last_stop = 128 fenced: 1150472176 last_start = 128 fenced: 1150472176 last_finish = 128 fenced: 1150472176 node_count = 0 fenced: 1150472176 start: fenced: 1150472176 event_id = 129 fenced: 1150472176 last_stop = 128 fenced: 1150472176 last_start = 129 fenced: 1150472176 last_finish = 128 fenced: 1150472176 node_count = 5 fenced: 1150472176 start_type = join fenced: 1150472176 members: fenced: 1150472176 nodeid = 1 "morph-04" fenced: 1150472176 nodeid = 5 "morph-03" fenced: 1150472176 nodeid = 4 "morph-01" fenced: 1150472176 nodeid = 3 "morph-02" fenced: 1150472176 nodeid = 2 "morph-05"
> I'm not sure how you want me to do this. No, me neither. I was hoping dave would chip in with some ideas :-)
The cman_tool services info above shows a pretty messed up cluster, I can't make too much sense of it. I suspect what happened is that after the nodes failed, things went bad. Then they came back and tried to start everything up again, and that made things worse. At that point, it's really hard to decipher what the original buggered state was. In order to figure this out, I think we'll want to stop everything and diagnose the problem when things first go bad, and not let the nodes restart services. Because you're killing 3 of 5 nodes and losing quorum, the killed nodes need to be allowed to rejoin the cluster, but that's all. So, we want to do something like this: - kill 3 nodes - after the 2 live nodes have seen they're dead, save the output of cman_tool services on those 2 nodes - let the 3 nodes reboot and rejoin the cluster, but not start anything else (like fenced) - watch cman_tool services on the 2 nodes to see if they successfully recover for the 3 failed nodes (that have now rejoined the cluster)
* Started cluster with morph-0[1-5] w/ 3 GFS file systems mounted. * Disabled cluster init scripts on all nodes. * Powered off morph-03, morph-04, and morph-05 via APC power switch. * morph-01 and morph-02 removed the failed nodes from the cluster. morph-01: Jun 19 11:43:24 morph-01 kernel: CMAN: node morph-05 has been removed from the cluster : Missed too many heartbeats Jun 19 11:43:27 morph-01 kernel: CMAN: removing node morph-04 from the cluster : Missed too many heartbeats Jun 19 11:43:33 morph-01 kernel: CMAN: removing node morph-03 from the cluster : No response to messages Jun 19 11:43:46 morph-01 kernel: CMAN: quorum lost, blocking activity # cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 0 - [1 5] DLM Lock Space: "clvmd" 3 4 recover 0 - [1 5] DLM Lock Space: "morph-cluster0" 4 5 recover 0 - [1 5] DLM Lock Space: "morph-cluster1" 6 7 recover 0 - [1 5] DLM Lock Space: "morph-cluster2" 8 9 recover 0 - [1 5] GFS Mount Group: "morph-cluster0" 5 6 recover 0 - [1 5] GFS Mount Group: "morph-cluster1" 7 8 recover 0 - [1 5] GFS Mount Group: "morph-cluster2" 9 10 recover 0 - [1 5] morph-02: Jun 19 11:43:31 morph-02 kernel: CMAN: removing node morph-05 from the cluster : Missed too many heartbeats Jun 19 11:43:37 morph-02 kernel: CMAN: removing node morph-03 from the cluster : No response to messages Jun 19 11:43:43 morph-02 kernel: CMAN: removing node morph-04 from the cluster : No response to messages Jun 19 11:43:53 morph-02 kernel: CMAN: quorum lost, blocking activity # cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 0 - [1 5] DLM Lock Space: "clvmd" 3 4 recover 0 - [1 5] DLM Lock Space: "morph-cluster0" 4 5 recover 0 - [1 5] DLM Lock Space: "morph-cluster1" 6 7 recover 0 - [1 5] DLM Lock Space: "morph-cluster2" 8 9 recover 0 - [1 5] GFS Mount Group: "morph-cluster0" 5 6 recover 0 - [1 5] GFS Mount Group: "morph-cluster1" 7 8 recover 0 - [1 5] GFS Mount Group: "morph-cluster2" 9 10 recover 0 - [1 5] * Powered on nodes morph-03, morph-04, and morph-05. * Started services ccsd and cman on morph-0[3-5] * All nodes joined the cluster, morph-02 deferred to morph-01 for fencing. morph-01: [root@morph-01 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 5 M morph-01 2 1 5 M morph-03 3 1 5 M morph-05 4 1 5 M morph-04 5 1 5 M morph-02 [root@morph-01 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 5] DLM Lock Space: "clvmd" 3 4 run - [1 5] DLM Lock Space: "morph-cluster0" 4 5 run - [1 5] DLM Lock Space: "morph-cluster1" 6 7 run - [1 5] DLM Lock Space: "morph-cluster2" 8 9 run - [1 5] GFS Mount Group: "morph-cluster0" 5 6 run - [1 5] GFS Mount Group: "morph-cluster1" 7 8 run - [1 5] GFS Mount Group: "morph-cluster2" 9 10 run - [1 5] morph-02: [root@morph-02 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 5 M morph-01 2 1 5 M morph-03 3 1 5 M morph-05 4 1 5 M morph-04 5 1 5 M morph-02 [root@morph-02 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 5] DLM Lock Space: "clvmd" 3 4 run - [1 5] DLM Lock Space: "morph-cluster0" 4 5 run - [1 5] DLM Lock Space: "morph-cluster1" 6 7 run - [1 5] DLM Lock Space: "morph-cluster2" 8 9 run - [1 5] GFS Mount Group: "morph-cluster0" 5 6 run - [1 5] GFS Mount Group: "morph-cluster1" 7 8 run - [1 5] GFS Mount Group: "morph-cluster2" 9 10 run - [1 5]
So that sequence of events worked ok. I think we need a more exact sequence of events that leads to the hung cluster state. The info in comment #9 seems to show some mix of nodes leaving, joining and failing, and I suspect that part of that appeared after things had already gone bad. If we can get a specific log/trace of a revolver run that leads to a hang, that would be a good start. One problem I've had with revolver in the past is that it doesn't always detect/quit when things first break, so we may need to add some new checking or logging to kill it when things break or have a record of the state when things first broke.
What is the best way to determine when "things break?" It would be wonderful to hear from the developers what to check for and record and when.
Yeah, that's pretty difficult because with many bugs there's no obvious sign apart from nothing happening for a "long time". After nodes have been killed and recovery is to happen, one good check that everything is back to a good/normal state is that all groups in cman_tool services have both a state of "run" and a code of "-". We don't necessarily need to know exactly when things break, either. If we print the state of cman_tool services at various stages of the test, that can help when we go back and reconstruct what happened in what order. In this case, we should just start with reproducing the problem, looking at the end result and the revolver log and see where we'd like revolver to print some extra cman_tool services. We may also find some spots in the sequence where we'd like revolver to pause and wait for some specific state of the services.
I've reproduced (1) and here is the end result from cman_tool services: morph-01:: Fence Domain: "default" 1 2 join S-4,4,1 [3 2 1] DLM Lock Space: "clvmd" 0 3 join S-1,80,5 [] morph-02:: Fence Domain: "default" 1 2 run U-1,10,1 [3 2] DLM Lock Space: "clvmd" 3 3 run U-1,10,5 [3 2] DLM Lock Space: "morph-cluster0" 4 5 run - [3 2] DLM Lock Space: "morph-cluster1" 6 7 run - [3 2] DLM Lock Space: "morph-cluster2" 8 9 run - [3 2] GFS Mount Group: "morph-cluster0" 5 6 run - [3 2] GFS Mount Group: "morph-cluster1" 7 8 run - [3 2] GFS Mount Group: "morph-cluster2" 9 10 run - [3 2] User: "usrm::manager" 10 4 recover 2 - [3 2] morph-03:: Fence Domain: "default" 0 2 join S-1,480,5 [] DLM Lock Space: "clvmd" 3 3 join S-4,4,1 [2 3 5] morph-04:: Fence Domain: "default" 0 2 join S-1,80,5 [] DLM Lock Space: "clvmd" 0 3 join S-1,80,5 [] morph-05:: Fence Domain: "default" 1 2 run U-1,10,1 [3 2] DLM Lock Space: "clvmd" 3 3 run U-1,10,5 [3 2] DLM Lock Space: "morph-cluster0" 4 5 run - [3 2] DLM Lock Space: "morph-cluster1" 6 7 run - [3 2] DLM Lock Space: "morph-cluster2" 8 9 run - [3 2] GFS Mount Group: "morph-cluster0" 5 6 run - [3 2] GFS Mount Group: "morph-cluster1" 7 8 run - [3 2] GFS Mount Group: "morph-cluster2" 9 10 run - [3 2] User: "usrm::manager" 10 4 recover 4 - [3 2]
Here is a description of what revolver does: Start with a cluster with file systems mounted Start an I/O load on all nodes and all file systems for each iteration wait for five minutes to get the lock count up pick a set of nodes to shoot (duellist) for victim in duellist run reboot -fin on the node wait a few seconds for victim in duellist ping victim to make sure it died, else reboot -fin again for victim in duellist for node not in duellist verify victim has X in /proc/cluster/nodes on node wait for all nodes in duellist to be pingable again wait for all nodes in duellist to be qarshable again for node not in duellist check /proc/cluster/services for Fence in "run" state check /proc/cluster/services for DLM in "run" state check /proc/cluster/services for GFS in "run" state for victim in duellist make sure vgdisplay runs on victim without error remount file systems on victim
I'm fairly certain I see what's happening here. This is a side effect of the fix for bz 162014. rgmanager is ignored in most of our testing/reasoning about how gfs/dlm/fencing services work, but rgmanager does have an effect in a situation like this simply by the fact that it creates an sm service. What's happening here is that the rgmanager service is sitting in the recover state, and sm doesn't allow new groups to form until all recoveries are done. I suspect rgmanager is sitting in the recover state because it's polling /proc/cluster/services waiting for all the failed nodes to be fenced. They have been fenced, but I'm guessing rgmanager is just looking at the "state" value, seeing it's not "run" and concluding that the nodes haven't been recovered. It's not in the run state because someone is joining it. If I'm correct about rgmanager, we might be able to fix this problem by making it decipher /proc/cluster/services more precisely. Or, we could add a check in SM to not consider user-level services when blocking activity for pending recoveries. Changing rgmanager may have some other positive effects, but it might be difficult to really determine what's been fenced by parsing /proc/cluster/services. Lon, does this speculation sound correct, and if so would you prefer to tinker with the proc parsing or have me change cman/sm? I don't have a preference.
Dave, that sounds very likely. When I set up for revolver I turn on all of the init scripts for cluster stuff, including rgmanager, but I don't start every service. rgmanager usually isn't started during cluster_setup for revolver. After a node is shot rgmanager is started from the init scripts. This could explain why it doesn't always trip on the first try. I'm going to run two experiments to try to verify this. 1. Start rgmanager before running revolver. 2. Turn off the rgmanager init script before running revolver.
rgmanager doesn't usually parse /proc/cluster/services (except to get members of other user level groups); its recovery is not (or should not be) started until fencing completes. It's sitting in the recover state because the main thread is somehow blocked and not handling one of the node-death events - it probably got the "inquorate" event before the "group change" event, or vice-versa. When 3/5 have died, the cluster lost quorum, right? So there might be some issue in this case with rgmanager that I haven't seen before. I'll look into it this afternoon or tomorrow morning. This does not look like a CMAN issue, but changing it to ignore user services in recover state when higher-priority services (fence, etc.) are trying to form might eliminate this kind of bug report in the future. If possible, could you get a backtrace of rgmanager when this happens?
Results of experiment 1: It still took until the second iteration for clvmd startup to hang. The output from cman_tool services was similar to #20. Results of experiment 2: Revolver made it through 10 iterations without problems. Going for 50...
Here are the backtraces for clurgmgrd from two surviving nodes: morph-01: [root@morph-01 ~]# ps -C clurgmgrd PID TTY TIME CMD 3258 ? 00:00:00 clurgmgrd 3259 ? 00:00:00 clurgmgrd [root@morph-01 ~]# gdb /usr/sbin/clurgmgrd 3258 (gdb) bt #0 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0041b393 in __waitpid_nocancel () from /lib/tls/libpthread.so.0 #2 0x0805966f in watchdog_init () at watchdog.c:73 #3 0x0804d37a in main (argc=1, argv=0xbff3f7f4) at main.c:740 #4 0x002bade3 in __libc_start_main () from /lib/tls/libc.so.6 #5 0x0804a75d in _start () [root@morph-01 ~]# gdb /usr/sbin/clurgmgrd 3259 (gdb) info threads 3 Thread -1208480848 (LWP 4109) 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 2 Thread -1218970704 (LWP 4456) 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 1 Thread -1208478016 (LWP 3259) 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) thread 1 [Switching to thread 1 (Thread -1208478016 (LWP 3259))]#0 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) bt #0 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00417b26 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0804bdc2 in rg_wait_threads () at rg_locks.c:271 #3 0x0804ecd7 in rg_doall (request=10, block=1, debugfmt=0x8063773 "Emergency stop of %s") at groups.c:747 #4 0x0804cd38 in handle_cluster_event (fd=4) at main.c:487 #5 0x0804ceb2 in event_loop (clusterfd=4) at main.c:537 #6 0x0804d71d in main (argc=1, argv=0xbff3f7f4) at main.c:843 #7 0x002bade3 in __libc_start_main () from /lib/tls/libc.so.6 #8 0x0804a75d in _start () (gdb) thread 2 [Switching to thread 2 (Thread -1218970704 (LWP 4456))]#0 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) bt #0 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00417b26 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0804bdc2 in rg_wait_threads () at rg_locks.c:271 #3 0x0804ecd7 in rg_doall (request=10, block=1, debugfmt=0x8063c14 "Initializing %s\n") at groups.c:747 #4 0x0804f8f4 in init_resource_groups (reconfigure=0) at groups.c:1095 #5 0x0804bfe5 in node_event (local=1, nodeID=1, nodeStatus=1) at main.c:152 #6 0x080593bd in node_event_thread (arg=0x0) at nodeevent.c:56 #7 0x00415371 in start_thread () from /lib/tls/libpthread.so.0 #8 0x0036de5e in clone () from /lib/tls/libc.so.6 (gdb) thread 3 [Switching to thread 3 (Thread -1208480848 (LWP 4109))]#0 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) bt #0 0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x003668b1 in ___newselect_nocancel () from /lib/tls/libc.so.6 #2 0x0805b897 in vf_event_loop (my_node_id=1) at vft.c:855 #3 0x0805bbd8 in vf_server (arg=0x8ffe170) at vft.c:943 #4 0x00415371 in start_thread () from /lib/tls/libpthread.so.0 #5 0x0036de5e in clone () from /lib/tls/libc.so.6 morph-03: [root@morph-03 ~]# ps -C clurgmgrd PID TTY TIME CMD 3139 ? 00:00:00 clurgmgrd 3140 ? 00:00:00 clurgmgrd [root@morph-03 ~]# gdb /usr/sbin/clurgmgrd 3139 (gdb) bt #0 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00658393 in __waitpid_nocancel () from /lib/tls/libpthread.so.0 #2 0x0805966f in watchdog_init () at watchdog.c:73 #3 0x0804d37a in main (argc=1, argv=0xbfe9d1a4) at main.c:740 #4 0x004f7de3 in __libc_start_main () from /lib/tls/libc.so.6 #5 0x0804a75d in _start () [root@morph-03 ~]# gdb /usr/sbin/clurgmgrd 3140 (gdb) info threads 3 Thread -1208554576 (LWP 3913) 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 2 Thread -1219044432 (LWP 4234) 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 1 Thread -1208551744 (LWP 3140) 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) thread 1 [Switching to thread 1 (Thread -1208551744 (LWP 3140))]#0 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) bt #0 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x005a38b1 in ___newselect_nocancel () from /lib/tls/libc.so.6 #2 0x0804ce6f in event_loop (clusterfd=4) at main.c:529 #3 0x0804d71d in main (argc=1, argv=0xbfe9d1a4) at main.c:843 #4 0x004f7de3 in __libc_start_main () from /lib/tls/libc.so.6 #5 0x0804a75d in _start () (gdb) thread 2 [Switching to thread 2 (Thread -1219044432 (LWP 4234))]#0 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) bt #0 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00654b26 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0804bdc2 in rg_wait_threads () at rg_locks.c:271 #3 0x0804ecd7 in rg_doall (request=10, block=1, debugfmt=0x8063c14 "Initializing %s\n") at groups.c:747 #4 0x0804f8f4 in init_resource_groups (reconfigure=0) at groups.c:1095 #5 0x0804bfe5 in node_event (local=1, nodeID=4, nodeStatus=1) at main.c:152 #6 0x080593bd in node_event_thread (arg=0x0) at nodeevent.c:56 #7 0x00652371 in start_thread () from /lib/tls/libpthread.so.0 #8 0x005aae5e in clone () from /lib/tls/libc.so.6 (gdb) thread 3 [Switching to thread 3 (Thread -1208554576 (LWP 3913))]#0 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) bt #0 0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x005a38b1 in ___newselect_nocancel () from /lib/tls/libc.so.6 #2 0x0805b897 in vf_event_loop (my_node_id=4) at vft.c:855 #3 0x0805bbd8 in vf_server (arg=0x9f63008) at vft.c:943 #4 0x00652371 in start_thread () from /lib/tls/libpthread.so.0 #5 0x005aae5e in clone () from /lib/tls/libc.so.6
thread (refcnt) leak in rgmanager...
Not really a leak, but effectively, the node_event_thread was bumping the refcount then waiting for it to reach zero later on.
Created attachment 131306 [details] Don't twiddle the rg thread refcount; this is not the right thread to do it in (only rg threads should twiddle the rg thread count)
fixes are actually in CVS
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0557.html