Bug 678704 - dlm recovery did not complete after fencing
Summary: dlm recovery did not complete after fencing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-18 22:21 UTC by Nate Straz
Modified: 2011-05-19 12:54 UTC (History)
6 users (show)

Fixed In Version: cluster-3.0.12-34.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-19 12:54:39 UTC
Target Upstream Version:


Attachments (Terms of Use)
possible fix (1.27 KB, text/plain)
2011-02-18 22:50 UTC, David Teigland
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0537 0 normal SHIPPED_LIVE cluster and gfs2-utils bug fix update 2011-05-18 17:57:40 UTC

Description Nate Straz 2011-02-18 22:21:32 UTC
Description of problem:

During recovery testing (with revolver) the cluster got stuck on an lvm command.  Upon inspection I found DLM locks which indicated a dead node was their master.

 Scenario iteration 0.5 started at Fri Feb 18 09:48:36 CST 2011
 Sleeping 5 minute(s) to let the I/O get its lock count up...
 Senario: DLM kill Quorum plus one

 Those picked to face the revolver... dash-01 dash-02

dash-03 shows that -01 and -02 are lost

Feb 18 09:54:13 dash-03 corosync[1400]:   [QUORUM] Members[2]: 2 3
Feb 18 09:54:13 dash-03 corosync[1400]:   [CMAN  ] quorum lost, blocking activity
Feb 18 09:54:13 dash-03 corosync[1400]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 18 09:54:13 dash-03 corosync[1400]:   [QUORUM] Members[1]: 3

dash-01 rejoins the cluster

Feb 18 09:55:27 dash-03 corosync[1400]:   [CMAN  ] quorum regained, resuming activity
Feb 18 09:55:27 dash-03 corosync[1400]:   [QUORUM] This node is within the primary component and will provide service.
Feb 18 09:55:27 dash-03 corosync[1400]:   [QUORUM] Members[2]: 1 3

dash-03 fences dash-02

Feb 18 09:55:51 dash-03 fenced[1459]: fencing node dash-02
Feb 18 09:55:58 dash-03 fenced[1459]: fence dash-02 success

dash-02 eventually rejoins

Feb 18 09:58:14 dash-03 corosync[1400]:   [QUORUM] Members[3]: 1 2 3

locks in the clvmd lockspace show dash-02 is still the master of those locks

[root@dash-03 ~]# dlm_tool lockdump clvmd
id 00ee0001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZVrABUV264vkcr0y6DV5ewsVs70ARiLCd"
id 02950001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZ27kOjC2raXlLwUDCme3nWBdK0Z2ZriLN"
id 01d20001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZkvOX59R5SXzW5Jp0RRnOgHtm8pRxOcOD"


Version-Release number of selected component (if applicable):
cman-3.0.12-33.el6.x86_64
kernel-2.6.32-114.0.1.el6.x86_64
openais-1.1.1-7.el6.x86_64
corosync-1.2.3-26.el6.x86_64


How reproducible:
Unknown

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2011-02-18 22:46:11 UTC
dlm_controld is waiting for fenced to report a recent fencing time for node1 (more recent that the time it failed) but the last_fenced_time from fenced is older than that.

From dlm_controld:

> 1298044453 clvmd check_fencing 1 wait add 1298044112 fail 1298044453 last 1298044007

From fenced:

1298044531 fenced:daemon conf 2 1 0 memb 1 3 join 1 left
1298044531 receive_protocol from 1 max 1.1.1.0 run 0.0.0.0
1298044531 daemon node 1 max 0.0.0.0 run 0.0.0.0
1298044531 daemon node 1 join 1298044531 left 1298044453 local quorum 1298044527
1298044531 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1
1298044531 daemon node 3 max 1.1.1.0 run 1.1.1.1
1298044531 daemon node 3 join 1298043142 left 0 local quorum 1298044527
1298044531 daemon_member 1 is clean
1298044531 reduce victim dash-01
1298044531 send_victim_done cg 7 flags 2 victim nodeid 1
1298044551 delay of 24s leaves 1 victims
1298044551 dash-02 not a cluster member after 20 sec post_join_delay
1298044551 receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
1298044551 daemon node 1 max 1.1.1.0 run 0.0.0.0
1298044551 daemon node 1 join 1298044531 left 1298044453 local quorum 1298044527
1298044551 fencing node dash-02
1298044558 fence dash-02 dev 0.0 agent fence_apc result: success
1298044558 fence dash-02 success
1298044558 failed to get dbus connection
1298044558 send_victim_done cg 7 flags 2 victim nodeid 2
1298044558 send_complete 3:7 flags 2 started 5 m 1 j 0 r 1 f 1
1298044558 receive_victim_done 3:7 flags 2 len 80
1298044558 receive_victim_done 3:7 no victim nodeid 1
1298044558 fenced:default conf 2 1 0 memb 1 3 join 1 left
1298044558 add_change cg 8 joined nodeid 1
1298044558 add_change cg 8 m 2 j 1 r 0 f 0
1298044558 check_ringid done cluster 1328 cpg 1:1328
1298044558 check_quorum done
1298044558 send_start 3:8 flags 2 started 6 m 2 j 1 r 0 f 0
1298044558 receive_start 1:1 len 192
1298044558 match_change 1:1 matches cg 8
1298044558 wait_messages cg 8 need 1 of 2
1298044558 receive_victim_done 3:7 flags 2 len 80
1298044558 receive_victim_done 3:7 remove victim 2 time 1298044558 how 1

The problem appears to originate from the fact that the victim entry for dash-01 is removed due to dash-01 being forgiven (after joining cleanly):

"no victim nodeid 1"

This is the point where fenced is supposed to set fence_time_local, which would be passed back to dlm_controld as the last_fenced_time.  In this case, it looks like it would have saved 1298044531 as fence_time_local (from the send_victim_done message for node1).  This would have satisified the dlm_controld fencing time check.

Comment 2 David Teigland 2011-02-18 22:50:44 UTC
Created attachment 479623 [details]
possible fix

This is a quick hunch about what may fix this, basically removing a "return".
I'll want to step through the code again more closely and test it out.

Comment 3 David Teigland 2011-02-18 22:58:13 UTC
I suspect that the patch for bug 624844 allows us to see this problem when we wouldn't have previously.  Without that patch I don't believe we would have forgiven node1 when it rejoined, but would have fenced it (unnecessarily), and not arrived in the situation above.  Description of patch for bz 624844:

commit d8a76837ecd24a5f7ab9a2ebf72745e6c6752c0b
Author: David Teigland <teigland>
Date:   Tue Aug 17 16:42:30 2010 -0500

    fenced: use post_join_delay after cluster join
    
    When the cluster has lost quorum due to a node failure,
    the next event is generally a cluster node join which
    gives the cluster quorum again.  With quorum, fenced
    begins fencing any failed nodes, applying post_fail_delay
    since the last cpg event was a node failure.  In this
    case, however, post_join_delay is more appropriate since
    the chances are good that nodes being fenced will be joining.
    Detect this case where a node joins the cluster giving it
    quorum, and use post_join_delay.

Comment 4 David Teigland 2011-02-22 21:45:29 UTC
Tested as follows:

- nodes 2,4,5: service cman start, service clvmd start, mount gfs
- nodes 2,4: ifdown eth0
- nodes 2,4: reboot -fin
- wait for reboot
- node 2: service cman start
- node 5: check status from group_tool


Test run without patch, broken.  Output after last step of test:

[root@bull-05 fenced]# group_tool
fence domain
member count  2
victim count  0
victim now    0
master nodeid 5
wait state    none
members       2 5 

dlm lockspaces
name          big
id            0x2fd45270
flags         0x0000000c fs_reg,kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       2 4 5 
new change    member 1 joined 0 remove 1 failed 1 seq 2,3
new status    wait_messages 0 wait_condition 1 fencing
new members   5 

name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       2 4 5 
new change    member 1 joined 0 remove 1 failed 1 seq 2,3
new status    wait_messages 0 wait_condition 1 fencing
new members   5 

gfs mountgroups
name          big
id            0x10e1ad7f
flags         0x0000008c recover,mounted,blocked
change        member 1 joined 0 remove 1 failed 1 seq 2,3
members       5 


1298409894 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
1298409894 daemon node 2 max 0.0.0.0 run 0.0.0.0
1298409894 daemon node 2 join 1298409894 left 1298409570 local quorum 1298409889
1298409894 daemon_member 2 is clean
1298409894 reduce victim bull-02
1298409894 send_victim_done cg 3 flags 2 victim nodeid 2
1298409914 delay of 25s leaves 1 victims
1298409914 bull-04 not a cluster member after 20 sec post_join_delay
1298409914 receive_protocol from 5 max 1.1.1.0 run 1.1.1.1
1298409914 daemon node 5 max 1.1.1.0 run 1.1.1.0
1298409914 daemon node 5 join 1298409484 left 0 local quorum 1298409889
1298409914 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
1298409914 daemon node 2 max 1.1.1.0 run 0.0.0.0
1298409914 daemon node 2 join 1298409894 left 1298409570 local quorum 1298409889
1298409914 fencing node bull-04
1298409914 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409914 fence bull-04 failed
1298409914 failed to get dbus connection
1298409917 fencing node bull-04
1298409917 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409917 fence bull-04 failed
1298409917 failed to get dbus connection
1298409920 fencing node bull-04
1298409920 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409920 fence bull-04 failed
1298409920 failed to get dbus connection
1298409921 fence bull-04 overridden by administrator intervention
1298409921 send_victim_done cg 3 flags 2 victim nodeid 4
1298409921 send_complete 5:3 flags 2 started 1 m 1 j 0 r 1 f 1
1298409921 receive_victim_done 5:3 flags 2 len 80
1298409921 receive_victim_done 5:3 no victim nodeid 2
1298409921 fenced:default conf 2 1 0 memb 2 5 join 2 left
1298409921 add_change cg 4 joined nodeid 2
1298409921 add_change cg 4 m 2 j 1 r 0 f 0
1298409921 check_ringid done cluster 468 cpg 2:468
1298409921 check_quorum done


Test run with patch, works correctly.  Output after last step of test:

[root@bull-05 ~]# group_tool
fence domain
member count  2   
victim count  0   
victim now    0   
master nodeid 5   
wait state    none
members       2 5 

dlm lockspaces
name          big 
id            0x2fd45270
flags         0x00000008 fs_reg
change        member 1 joined 0 remove 1 failed 1 seq 2,3
members       5   

name          clvmd
id            0x4104eefa
flags         0x00000000
change        member 1 joined 0 remove 1 failed 1 seq 2,3
members       5   

gfs mountgroups   
name          big 
id            0x10e1ad7f
flags         0x00000008 mounted
change        member 1 joined 0 remove 1 failed 1 seq 2,3
members       5   


1298408708 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
1298408708 daemon node 2 max 0.0.0.0 run 0.0.0.0
1298408708 daemon node 2 join 1298408708 left 1298408453 local quorum 1298408704
1298408708 daemon_member 2 is clean
1298408708 reduce victim bull-02
1298408708 send_victim_done cg 3 flags 2 victim nodeid 2
1298408728 delay of 24s leaves 1 victims
1298408728 bull-04 not a cluster member after 20 sec post_join_delay
1298408728 receive_protocol from 5 max 1.1.1.0 run 1.1.1.1
1298408728 daemon node 5 max 1.1.1.0 run 1.1.1.0
1298408728 daemon node 5 join 1298408288 left 0 local quorum 1298408704
1298408728 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
1298408728 daemon node 2 max 1.1.1.0 run 0.0.0.0
1298408728 daemon node 2 join 1298408708 left 1298408453 local quorum 1298408704
1298408728 fencing node bull-04
1298408728 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408728 fence bull-04 failed
1298408728 failed to get dbus connection
1298408731 fencing node bull-04
1298408731 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408731 fence bull-04 failed
1298408731 failed to get dbus connection
1298408734 fencing node bull-04
1298408734 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408734 fence bull-04 failed
1298408734 failed to get dbus connection
1298408749 fence bull-04 overridden by administrator intervention
1298408749 send_victim_done cg 3 flags 2 victim nodeid 4
1298408749 send_complete 5:3 flags 2 started 1 m 1 j 0 r 1 f 1
1298408749 receive_victim_done 5:3 flags 2 len 80
1298408749 receive_victim_done 5:3 no victim nodeid 2
1298408749 receive_victim_done 5:3 remove victim 2 time 1298408708 how 2
> The line above is missing in the debug log from the broken demonstration
1298408749 fenced:default conf 2 1 0 memb 2 5 join 2 left
1298408749 add_change cg 4 joined nodeid 2
1298408749 add_change cg 4 m 2 j 1 r 0 f 0
1298408749 check_ringid done cluster 440 cpg 2:440
1298408749 check_quorum done

Comment 7 Nate Straz 2011-04-11 14:54:54 UTC
Got through 40+ iterations of revolver over the weekend.  Marking this verified.

Comment 8 errata-xmlrpc 2011-05-19 12:54:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html


Note You need to log in before you can comment on or make changes to this bug.