Bug 678704
| Summary: | dlm recovery did not complete after fencing | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Nate Straz <nstraz> | ||||
| Component: | cluster | Assignee: | David Teigland <teigland> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.1 | CC: | ccaulfie, cluster-maint, fdinitto, lhh, rpeterso, teigland | ||||
| Target Milestone: | rc | Keywords: | Regression | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | cluster-3.0.12-34.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-05-19 12:54:39 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Nate Straz
2011-02-18 22:21:32 UTC
dlm_controld is waiting for fenced to report a recent fencing time for node1 (more recent that the time it failed) but the last_fenced_time from fenced is older than that.
From dlm_controld:
> 1298044453 clvmd check_fencing 1 wait add 1298044112 fail 1298044453 last 1298044007
From fenced:
1298044531 fenced:daemon conf 2 1 0 memb 1 3 join 1 left
1298044531 receive_protocol from 1 max 1.1.1.0 run 0.0.0.0
1298044531 daemon node 1 max 0.0.0.0 run 0.0.0.0
1298044531 daemon node 1 join 1298044531 left 1298044453 local quorum 1298044527
1298044531 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1
1298044531 daemon node 3 max 1.1.1.0 run 1.1.1.1
1298044531 daemon node 3 join 1298043142 left 0 local quorum 1298044527
1298044531 daemon_member 1 is clean
1298044531 reduce victim dash-01
1298044531 send_victim_done cg 7 flags 2 victim nodeid 1
1298044551 delay of 24s leaves 1 victims
1298044551 dash-02 not a cluster member after 20 sec post_join_delay
1298044551 receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
1298044551 daemon node 1 max 1.1.1.0 run 0.0.0.0
1298044551 daemon node 1 join 1298044531 left 1298044453 local quorum 1298044527
1298044551 fencing node dash-02
1298044558 fence dash-02 dev 0.0 agent fence_apc result: success
1298044558 fence dash-02 success
1298044558 failed to get dbus connection
1298044558 send_victim_done cg 7 flags 2 victim nodeid 2
1298044558 send_complete 3:7 flags 2 started 5 m 1 j 0 r 1 f 1
1298044558 receive_victim_done 3:7 flags 2 len 80
1298044558 receive_victim_done 3:7 no victim nodeid 1
1298044558 fenced:default conf 2 1 0 memb 1 3 join 1 left
1298044558 add_change cg 8 joined nodeid 1
1298044558 add_change cg 8 m 2 j 1 r 0 f 0
1298044558 check_ringid done cluster 1328 cpg 1:1328
1298044558 check_quorum done
1298044558 send_start 3:8 flags 2 started 6 m 2 j 1 r 0 f 0
1298044558 receive_start 1:1 len 192
1298044558 match_change 1:1 matches cg 8
1298044558 wait_messages cg 8 need 1 of 2
1298044558 receive_victim_done 3:7 flags 2 len 80
1298044558 receive_victim_done 3:7 remove victim 2 time 1298044558 how 1
The problem appears to originate from the fact that the victim entry for dash-01 is removed due to dash-01 being forgiven (after joining cleanly):
"no victim nodeid 1"
This is the point where fenced is supposed to set fence_time_local, which would be passed back to dlm_controld as the last_fenced_time. In this case, it looks like it would have saved 1298044531 as fence_time_local (from the send_victim_done message for node1). This would have satisified the dlm_controld fencing time check.
Created attachment 479623 [details]
possible fix
This is a quick hunch about what may fix this, basically removing a "return".
I'll want to step through the code again more closely and test it out.
I suspect that the patch for bug 624844 allows us to see this problem when we wouldn't have previously. Without that patch I don't believe we would have forgiven node1 when it rejoined, but would have fenced it (unnecessarily), and not arrived in the situation above. Description of patch for bz 624844: commit d8a76837ecd24a5f7ab9a2ebf72745e6c6752c0b Author: David Teigland <teigland> Date: Tue Aug 17 16:42:30 2010 -0500 fenced: use post_join_delay after cluster join When the cluster has lost quorum due to a node failure, the next event is generally a cluster node join which gives the cluster quorum again. With quorum, fenced begins fencing any failed nodes, applying post_fail_delay since the last cpg event was a node failure. In this case, however, post_join_delay is more appropriate since the chances are good that nodes being fenced will be joining. Detect this case where a node joins the cluster giving it quorum, and use post_join_delay. Tested as follows:
- nodes 2,4,5: service cman start, service clvmd start, mount gfs
- nodes 2,4: ifdown eth0
- nodes 2,4: reboot -fin
- wait for reboot
- node 2: service cman start
- node 5: check status from group_tool
Test run without patch, broken. Output after last step of test:
[root@bull-05 fenced]# group_tool
fence domain
member count 2
victim count 0
victim now 0
master nodeid 5
wait state none
members 2 5
dlm lockspaces
name big
id 0x2fd45270
flags 0x0000000c fs_reg,kern_stop
change member 3 joined 1 remove 0 failed 0 seq 1,1
members 2 4 5
new change member 1 joined 0 remove 1 failed 1 seq 2,3
new status wait_messages 0 wait_condition 1 fencing
new members 5
name clvmd
id 0x4104eefa
flags 0x00000004 kern_stop
change member 3 joined 1 remove 0 failed 0 seq 1,1
members 2 4 5
new change member 1 joined 0 remove 1 failed 1 seq 2,3
new status wait_messages 0 wait_condition 1 fencing
new members 5
gfs mountgroups
name big
id 0x10e1ad7f
flags 0x0000008c recover,mounted,blocked
change member 1 joined 0 remove 1 failed 1 seq 2,3
members 5
1298409894 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
1298409894 daemon node 2 max 0.0.0.0 run 0.0.0.0
1298409894 daemon node 2 join 1298409894 left 1298409570 local quorum 1298409889
1298409894 daemon_member 2 is clean
1298409894 reduce victim bull-02
1298409894 send_victim_done cg 3 flags 2 victim nodeid 2
1298409914 delay of 25s leaves 1 victims
1298409914 bull-04 not a cluster member after 20 sec post_join_delay
1298409914 receive_protocol from 5 max 1.1.1.0 run 1.1.1.1
1298409914 daemon node 5 max 1.1.1.0 run 1.1.1.0
1298409914 daemon node 5 join 1298409484 left 0 local quorum 1298409889
1298409914 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
1298409914 daemon node 2 max 1.1.1.0 run 0.0.0.0
1298409914 daemon node 2 join 1298409894 left 1298409570 local quorum 1298409889
1298409914 fencing node bull-04
1298409914 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409914 fence bull-04 failed
1298409914 failed to get dbus connection
1298409917 fencing node bull-04
1298409917 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409917 fence bull-04 failed
1298409917 failed to get dbus connection
1298409920 fencing node bull-04
1298409920 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409920 fence bull-04 failed
1298409920 failed to get dbus connection
1298409921 fence bull-04 overridden by administrator intervention
1298409921 send_victim_done cg 3 flags 2 victim nodeid 4
1298409921 send_complete 5:3 flags 2 started 1 m 1 j 0 r 1 f 1
1298409921 receive_victim_done 5:3 flags 2 len 80
1298409921 receive_victim_done 5:3 no victim nodeid 2
1298409921 fenced:default conf 2 1 0 memb 2 5 join 2 left
1298409921 add_change cg 4 joined nodeid 2
1298409921 add_change cg 4 m 2 j 1 r 0 f 0
1298409921 check_ringid done cluster 468 cpg 2:468
1298409921 check_quorum done
Test run with patch, works correctly. Output after last step of test:
[root@bull-05 ~]# group_tool
fence domain
member count 2
victim count 0
victim now 0
master nodeid 5
wait state none
members 2 5
dlm lockspaces
name big
id 0x2fd45270
flags 0x00000008 fs_reg
change member 1 joined 0 remove 1 failed 1 seq 2,3
members 5
name clvmd
id 0x4104eefa
flags 0x00000000
change member 1 joined 0 remove 1 failed 1 seq 2,3
members 5
gfs mountgroups
name big
id 0x10e1ad7f
flags 0x00000008 mounted
change member 1 joined 0 remove 1 failed 1 seq 2,3
members 5
1298408708 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
1298408708 daemon node 2 max 0.0.0.0 run 0.0.0.0
1298408708 daemon node 2 join 1298408708 left 1298408453 local quorum 1298408704
1298408708 daemon_member 2 is clean
1298408708 reduce victim bull-02
1298408708 send_victim_done cg 3 flags 2 victim nodeid 2
1298408728 delay of 24s leaves 1 victims
1298408728 bull-04 not a cluster member after 20 sec post_join_delay
1298408728 receive_protocol from 5 max 1.1.1.0 run 1.1.1.1
1298408728 daemon node 5 max 1.1.1.0 run 1.1.1.0
1298408728 daemon node 5 join 1298408288 left 0 local quorum 1298408704
1298408728 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
1298408728 daemon node 2 max 1.1.1.0 run 0.0.0.0
1298408728 daemon node 2 join 1298408708 left 1298408453 local quorum 1298408704
1298408728 fencing node bull-04
1298408728 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408728 fence bull-04 failed
1298408728 failed to get dbus connection
1298408731 fencing node bull-04
1298408731 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408731 fence bull-04 failed
1298408731 failed to get dbus connection
1298408734 fencing node bull-04
1298408734 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408734 fence bull-04 failed
1298408734 failed to get dbus connection
1298408749 fence bull-04 overridden by administrator intervention
1298408749 send_victim_done cg 3 flags 2 victim nodeid 4
1298408749 send_complete 5:3 flags 2 started 1 m 1 j 0 r 1 f 1
1298408749 receive_victim_done 5:3 flags 2 len 80
1298408749 receive_victim_done 5:3 no victim nodeid 2
1298408749 receive_victim_done 5:3 remove victim 2 time 1298408708 how 2
> The line above is missing in the debug log from the broken demonstration
1298408749 fenced:default conf 2 1 0 memb 2 5 join 2 left
1298408749 add_change cg 4 joined nodeid 2
1298408749 add_change cg 4 m 2 j 1 r 0 f 0
1298408749 check_ringid done cluster 440 cpg 2:440
1298408749 check_quorum done
pushed to RHEL6 branch http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=4eb20ad9f8e62d0739c7897d79bc1bccb94e5250 Got through 40+ iterations of revolver over the weekend. Marking this verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0537.html |