Hide Forgot
Description of problem: During recovery testing (with revolver) the cluster got stuck on an lvm command. Upon inspection I found DLM locks which indicated a dead node was their master. Scenario iteration 0.5 started at Fri Feb 18 09:48:36 CST 2011 Sleeping 5 minute(s) to let the I/O get its lock count up... Senario: DLM kill Quorum plus one Those picked to face the revolver... dash-01 dash-02 dash-03 shows that -01 and -02 are lost Feb 18 09:54:13 dash-03 corosync[1400]: [QUORUM] Members[2]: 2 3 Feb 18 09:54:13 dash-03 corosync[1400]: [CMAN ] quorum lost, blocking activity Feb 18 09:54:13 dash-03 corosync[1400]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Feb 18 09:54:13 dash-03 corosync[1400]: [QUORUM] Members[1]: 3 dash-01 rejoins the cluster Feb 18 09:55:27 dash-03 corosync[1400]: [CMAN ] quorum regained, resuming activity Feb 18 09:55:27 dash-03 corosync[1400]: [QUORUM] This node is within the primary component and will provide service. Feb 18 09:55:27 dash-03 corosync[1400]: [QUORUM] Members[2]: 1 3 dash-03 fences dash-02 Feb 18 09:55:51 dash-03 fenced[1459]: fencing node dash-02 Feb 18 09:55:58 dash-03 fenced[1459]: fence dash-02 success dash-02 eventually rejoins Feb 18 09:58:14 dash-03 corosync[1400]: [QUORUM] Members[3]: 1 2 3 locks in the clvmd lockspace show dash-02 is still the master of those locks [root@dash-03 ~]# dlm_tool lockdump clvmd id 00ee0001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZVrABUV264vkcr0y6DV5ewsVs70ARiLCd" id 02950001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZ27kOjC2raXlLwUDCme3nWBdK0Z2ZriLN" id 01d20001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZkvOX59R5SXzW5Jp0RRnOgHtm8pRxOcOD" Version-Release number of selected component (if applicable): cman-3.0.12-33.el6.x86_64 kernel-2.6.32-114.0.1.el6.x86_64 openais-1.1.1-7.el6.x86_64 corosync-1.2.3-26.el6.x86_64 How reproducible: Unknown Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
dlm_controld is waiting for fenced to report a recent fencing time for node1 (more recent that the time it failed) but the last_fenced_time from fenced is older than that. From dlm_controld: > 1298044453 clvmd check_fencing 1 wait add 1298044112 fail 1298044453 last 1298044007 From fenced: 1298044531 fenced:daemon conf 2 1 0 memb 1 3 join 1 left 1298044531 receive_protocol from 1 max 1.1.1.0 run 0.0.0.0 1298044531 daemon node 1 max 0.0.0.0 run 0.0.0.0 1298044531 daemon node 1 join 1298044531 left 1298044453 local quorum 1298044527 1298044531 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1 1298044531 daemon node 3 max 1.1.1.0 run 1.1.1.1 1298044531 daemon node 3 join 1298043142 left 0 local quorum 1298044527 1298044531 daemon_member 1 is clean 1298044531 reduce victim dash-01 1298044531 send_victim_done cg 7 flags 2 victim nodeid 1 1298044551 delay of 24s leaves 1 victims 1298044551 dash-02 not a cluster member after 20 sec post_join_delay 1298044551 receive_protocol from 1 max 1.1.1.0 run 1.1.1.0 1298044551 daemon node 1 max 1.1.1.0 run 0.0.0.0 1298044551 daemon node 1 join 1298044531 left 1298044453 local quorum 1298044527 1298044551 fencing node dash-02 1298044558 fence dash-02 dev 0.0 agent fence_apc result: success 1298044558 fence dash-02 success 1298044558 failed to get dbus connection 1298044558 send_victim_done cg 7 flags 2 victim nodeid 2 1298044558 send_complete 3:7 flags 2 started 5 m 1 j 0 r 1 f 1 1298044558 receive_victim_done 3:7 flags 2 len 80 1298044558 receive_victim_done 3:7 no victim nodeid 1 1298044558 fenced:default conf 2 1 0 memb 1 3 join 1 left 1298044558 add_change cg 8 joined nodeid 1 1298044558 add_change cg 8 m 2 j 1 r 0 f 0 1298044558 check_ringid done cluster 1328 cpg 1:1328 1298044558 check_quorum done 1298044558 send_start 3:8 flags 2 started 6 m 2 j 1 r 0 f 0 1298044558 receive_start 1:1 len 192 1298044558 match_change 1:1 matches cg 8 1298044558 wait_messages cg 8 need 1 of 2 1298044558 receive_victim_done 3:7 flags 2 len 80 1298044558 receive_victim_done 3:7 remove victim 2 time 1298044558 how 1 The problem appears to originate from the fact that the victim entry for dash-01 is removed due to dash-01 being forgiven (after joining cleanly): "no victim nodeid 1" This is the point where fenced is supposed to set fence_time_local, which would be passed back to dlm_controld as the last_fenced_time. In this case, it looks like it would have saved 1298044531 as fence_time_local (from the send_victim_done message for node1). This would have satisified the dlm_controld fencing time check.
Created attachment 479623 [details] possible fix This is a quick hunch about what may fix this, basically removing a "return". I'll want to step through the code again more closely and test it out.
I suspect that the patch for bug 624844 allows us to see this problem when we wouldn't have previously. Without that patch I don't believe we would have forgiven node1 when it rejoined, but would have fenced it (unnecessarily), and not arrived in the situation above. Description of patch for bz 624844: commit d8a76837ecd24a5f7ab9a2ebf72745e6c6752c0b Author: David Teigland <teigland> Date: Tue Aug 17 16:42:30 2010 -0500 fenced: use post_join_delay after cluster join When the cluster has lost quorum due to a node failure, the next event is generally a cluster node join which gives the cluster quorum again. With quorum, fenced begins fencing any failed nodes, applying post_fail_delay since the last cpg event was a node failure. In this case, however, post_join_delay is more appropriate since the chances are good that nodes being fenced will be joining. Detect this case where a node joins the cluster giving it quorum, and use post_join_delay.
Tested as follows: - nodes 2,4,5: service cman start, service clvmd start, mount gfs - nodes 2,4: ifdown eth0 - nodes 2,4: reboot -fin - wait for reboot - node 2: service cman start - node 5: check status from group_tool Test run without patch, broken. Output after last step of test: [root@bull-05 fenced]# group_tool fence domain member count 2 victim count 0 victim now 0 master nodeid 5 wait state none members 2 5 dlm lockspaces name big id 0x2fd45270 flags 0x0000000c fs_reg,kern_stop change member 3 joined 1 remove 0 failed 0 seq 1,1 members 2 4 5 new change member 1 joined 0 remove 1 failed 1 seq 2,3 new status wait_messages 0 wait_condition 1 fencing new members 5 name clvmd id 0x4104eefa flags 0x00000004 kern_stop change member 3 joined 1 remove 0 failed 0 seq 1,1 members 2 4 5 new change member 1 joined 0 remove 1 failed 1 seq 2,3 new status wait_messages 0 wait_condition 1 fencing new members 5 gfs mountgroups name big id 0x10e1ad7f flags 0x0000008c recover,mounted,blocked change member 1 joined 0 remove 1 failed 1 seq 2,3 members 5 1298409894 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0 1298409894 daemon node 2 max 0.0.0.0 run 0.0.0.0 1298409894 daemon node 2 join 1298409894 left 1298409570 local quorum 1298409889 1298409894 daemon_member 2 is clean 1298409894 reduce victim bull-02 1298409894 send_victim_done cg 3 flags 2 victim nodeid 2 1298409914 delay of 25s leaves 1 victims 1298409914 bull-04 not a cluster member after 20 sec post_join_delay 1298409914 receive_protocol from 5 max 1.1.1.0 run 1.1.1.1 1298409914 daemon node 5 max 1.1.1.0 run 1.1.1.0 1298409914 daemon node 5 join 1298409484 left 0 local quorum 1298409889 1298409914 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0 1298409914 daemon node 2 max 1.1.1.0 run 0.0.0.0 1298409914 daemon node 2 join 1298409894 left 1298409570 local quorum 1298409889 1298409914 fencing node bull-04 1298409914 fence bull-04 dev 0.0 agent /bin/false result: error from agent 1298409914 fence bull-04 failed 1298409914 failed to get dbus connection 1298409917 fencing node bull-04 1298409917 fence bull-04 dev 0.0 agent /bin/false result: error from agent 1298409917 fence bull-04 failed 1298409917 failed to get dbus connection 1298409920 fencing node bull-04 1298409920 fence bull-04 dev 0.0 agent /bin/false result: error from agent 1298409920 fence bull-04 failed 1298409920 failed to get dbus connection 1298409921 fence bull-04 overridden by administrator intervention 1298409921 send_victim_done cg 3 flags 2 victim nodeid 4 1298409921 send_complete 5:3 flags 2 started 1 m 1 j 0 r 1 f 1 1298409921 receive_victim_done 5:3 flags 2 len 80 1298409921 receive_victim_done 5:3 no victim nodeid 2 1298409921 fenced:default conf 2 1 0 memb 2 5 join 2 left 1298409921 add_change cg 4 joined nodeid 2 1298409921 add_change cg 4 m 2 j 1 r 0 f 0 1298409921 check_ringid done cluster 468 cpg 2:468 1298409921 check_quorum done Test run with patch, works correctly. Output after last step of test: [root@bull-05 ~]# group_tool fence domain member count 2 victim count 0 victim now 0 master nodeid 5 wait state none members 2 5 dlm lockspaces name big id 0x2fd45270 flags 0x00000008 fs_reg change member 1 joined 0 remove 1 failed 1 seq 2,3 members 5 name clvmd id 0x4104eefa flags 0x00000000 change member 1 joined 0 remove 1 failed 1 seq 2,3 members 5 gfs mountgroups name big id 0x10e1ad7f flags 0x00000008 mounted change member 1 joined 0 remove 1 failed 1 seq 2,3 members 5 1298408708 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0 1298408708 daemon node 2 max 0.0.0.0 run 0.0.0.0 1298408708 daemon node 2 join 1298408708 left 1298408453 local quorum 1298408704 1298408708 daemon_member 2 is clean 1298408708 reduce victim bull-02 1298408708 send_victim_done cg 3 flags 2 victim nodeid 2 1298408728 delay of 24s leaves 1 victims 1298408728 bull-04 not a cluster member after 20 sec post_join_delay 1298408728 receive_protocol from 5 max 1.1.1.0 run 1.1.1.1 1298408728 daemon node 5 max 1.1.1.0 run 1.1.1.0 1298408728 daemon node 5 join 1298408288 left 0 local quorum 1298408704 1298408728 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0 1298408728 daemon node 2 max 1.1.1.0 run 0.0.0.0 1298408728 daemon node 2 join 1298408708 left 1298408453 local quorum 1298408704 1298408728 fencing node bull-04 1298408728 fence bull-04 dev 0.0 agent /bin/false result: error from agent 1298408728 fence bull-04 failed 1298408728 failed to get dbus connection 1298408731 fencing node bull-04 1298408731 fence bull-04 dev 0.0 agent /bin/false result: error from agent 1298408731 fence bull-04 failed 1298408731 failed to get dbus connection 1298408734 fencing node bull-04 1298408734 fence bull-04 dev 0.0 agent /bin/false result: error from agent 1298408734 fence bull-04 failed 1298408734 failed to get dbus connection 1298408749 fence bull-04 overridden by administrator intervention 1298408749 send_victim_done cg 3 flags 2 victim nodeid 4 1298408749 send_complete 5:3 flags 2 started 1 m 1 j 0 r 1 f 1 1298408749 receive_victim_done 5:3 flags 2 len 80 1298408749 receive_victim_done 5:3 no victim nodeid 2 1298408749 receive_victim_done 5:3 remove victim 2 time 1298408708 how 2 > The line above is missing in the debug log from the broken demonstration 1298408749 fenced:default conf 2 1 0 memb 2 5 join 2 left 1298408749 add_change cg 4 joined nodeid 2 1298408749 add_change cg 4 m 2 j 1 r 0 f 0 1298408749 check_ringid done cluster 440 cpg 2:440 1298408749 check_quorum done
pushed to RHEL6 branch http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=4eb20ad9f8e62d0739c7897d79bc1bccb94e5250
Got through 40+ iterations of revolver over the weekend. Marking this verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0537.html