Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 678704

Summary: dlm recovery did not complete after fencing
Product: Red Hat Enterprise Linux 6 Reporter: Nate Straz <nstraz>
Component: clusterAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1CC: ccaulfie, cluster-maint, fdinitto, lhh, rpeterso, teigland
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cluster-3.0.12-34.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 12:54:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
possible fix none

Description Nate Straz 2011-02-18 22:21:32 UTC
Description of problem:

During recovery testing (with revolver) the cluster got stuck on an lvm command.  Upon inspection I found DLM locks which indicated a dead node was their master.

 Scenario iteration 0.5 started at Fri Feb 18 09:48:36 CST 2011
 Sleeping 5 minute(s) to let the I/O get its lock count up...
 Senario: DLM kill Quorum plus one

 Those picked to face the revolver... dash-01 dash-02

dash-03 shows that -01 and -02 are lost

Feb 18 09:54:13 dash-03 corosync[1400]:   [QUORUM] Members[2]: 2 3
Feb 18 09:54:13 dash-03 corosync[1400]:   [CMAN  ] quorum lost, blocking activity
Feb 18 09:54:13 dash-03 corosync[1400]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 18 09:54:13 dash-03 corosync[1400]:   [QUORUM] Members[1]: 3

dash-01 rejoins the cluster

Feb 18 09:55:27 dash-03 corosync[1400]:   [CMAN  ] quorum regained, resuming activity
Feb 18 09:55:27 dash-03 corosync[1400]:   [QUORUM] This node is within the primary component and will provide service.
Feb 18 09:55:27 dash-03 corosync[1400]:   [QUORUM] Members[2]: 1 3

dash-03 fences dash-02

Feb 18 09:55:51 dash-03 fenced[1459]: fencing node dash-02
Feb 18 09:55:58 dash-03 fenced[1459]: fence dash-02 success

dash-02 eventually rejoins

Feb 18 09:58:14 dash-03 corosync[1400]:   [QUORUM] Members[3]: 1 2 3

locks in the clvmd lockspace show dash-02 is still the master of those locks

[root@dash-03 ~]# dlm_tool lockdump clvmd
id 00ee0001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZVrABUV264vkcr0y6DV5ewsVs70ARiLCd"
id 02950001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZ27kOjC2raXlLwUDCme3nWBdK0Z2ZriLN"
id 01d20001 gr CR rq IV pid 1634 master 2 "4Q9X8YcvUxvflKuRPuiUbX62RA0Yt1hZkvOX59R5SXzW5Jp0RRnOgHtm8pRxOcOD"


Version-Release number of selected component (if applicable):
cman-3.0.12-33.el6.x86_64
kernel-2.6.32-114.0.1.el6.x86_64
openais-1.1.1-7.el6.x86_64
corosync-1.2.3-26.el6.x86_64


How reproducible:
Unknown

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2011-02-18 22:46:11 UTC
dlm_controld is waiting for fenced to report a recent fencing time for node1 (more recent that the time it failed) but the last_fenced_time from fenced is older than that.

From dlm_controld:

> 1298044453 clvmd check_fencing 1 wait add 1298044112 fail 1298044453 last 1298044007

From fenced:

1298044531 fenced:daemon conf 2 1 0 memb 1 3 join 1 left
1298044531 receive_protocol from 1 max 1.1.1.0 run 0.0.0.0
1298044531 daemon node 1 max 0.0.0.0 run 0.0.0.0
1298044531 daemon node 1 join 1298044531 left 1298044453 local quorum 1298044527
1298044531 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1
1298044531 daemon node 3 max 1.1.1.0 run 1.1.1.1
1298044531 daemon node 3 join 1298043142 left 0 local quorum 1298044527
1298044531 daemon_member 1 is clean
1298044531 reduce victim dash-01
1298044531 send_victim_done cg 7 flags 2 victim nodeid 1
1298044551 delay of 24s leaves 1 victims
1298044551 dash-02 not a cluster member after 20 sec post_join_delay
1298044551 receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
1298044551 daemon node 1 max 1.1.1.0 run 0.0.0.0
1298044551 daemon node 1 join 1298044531 left 1298044453 local quorum 1298044527
1298044551 fencing node dash-02
1298044558 fence dash-02 dev 0.0 agent fence_apc result: success
1298044558 fence dash-02 success
1298044558 failed to get dbus connection
1298044558 send_victim_done cg 7 flags 2 victim nodeid 2
1298044558 send_complete 3:7 flags 2 started 5 m 1 j 0 r 1 f 1
1298044558 receive_victim_done 3:7 flags 2 len 80
1298044558 receive_victim_done 3:7 no victim nodeid 1
1298044558 fenced:default conf 2 1 0 memb 1 3 join 1 left
1298044558 add_change cg 8 joined nodeid 1
1298044558 add_change cg 8 m 2 j 1 r 0 f 0
1298044558 check_ringid done cluster 1328 cpg 1:1328
1298044558 check_quorum done
1298044558 send_start 3:8 flags 2 started 6 m 2 j 1 r 0 f 0
1298044558 receive_start 1:1 len 192
1298044558 match_change 1:1 matches cg 8
1298044558 wait_messages cg 8 need 1 of 2
1298044558 receive_victim_done 3:7 flags 2 len 80
1298044558 receive_victim_done 3:7 remove victim 2 time 1298044558 how 1

The problem appears to originate from the fact that the victim entry for dash-01 is removed due to dash-01 being forgiven (after joining cleanly):

"no victim nodeid 1"

This is the point where fenced is supposed to set fence_time_local, which would be passed back to dlm_controld as the last_fenced_time.  In this case, it looks like it would have saved 1298044531 as fence_time_local (from the send_victim_done message for node1).  This would have satisified the dlm_controld fencing time check.

Comment 2 David Teigland 2011-02-18 22:50:44 UTC
Created attachment 479623 [details]
possible fix

This is a quick hunch about what may fix this, basically removing a "return".
I'll want to step through the code again more closely and test it out.

Comment 3 David Teigland 2011-02-18 22:58:13 UTC
I suspect that the patch for bug 624844 allows us to see this problem when we wouldn't have previously.  Without that patch I don't believe we would have forgiven node1 when it rejoined, but would have fenced it (unnecessarily), and not arrived in the situation above.  Description of patch for bz 624844:

commit d8a76837ecd24a5f7ab9a2ebf72745e6c6752c0b
Author: David Teigland <teigland>
Date:   Tue Aug 17 16:42:30 2010 -0500

    fenced: use post_join_delay after cluster join
    
    When the cluster has lost quorum due to a node failure,
    the next event is generally a cluster node join which
    gives the cluster quorum again.  With quorum, fenced
    begins fencing any failed nodes, applying post_fail_delay
    since the last cpg event was a node failure.  In this
    case, however, post_join_delay is more appropriate since
    the chances are good that nodes being fenced will be joining.
    Detect this case where a node joins the cluster giving it
    quorum, and use post_join_delay.

Comment 4 David Teigland 2011-02-22 21:45:29 UTC
Tested as follows:

- nodes 2,4,5: service cman start, service clvmd start, mount gfs
- nodes 2,4: ifdown eth0
- nodes 2,4: reboot -fin
- wait for reboot
- node 2: service cman start
- node 5: check status from group_tool


Test run without patch, broken.  Output after last step of test:

[root@bull-05 fenced]# group_tool
fence domain
member count  2
victim count  0
victim now    0
master nodeid 5
wait state    none
members       2 5 

dlm lockspaces
name          big
id            0x2fd45270
flags         0x0000000c fs_reg,kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       2 4 5 
new change    member 1 joined 0 remove 1 failed 1 seq 2,3
new status    wait_messages 0 wait_condition 1 fencing
new members   5 

name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       2 4 5 
new change    member 1 joined 0 remove 1 failed 1 seq 2,3
new status    wait_messages 0 wait_condition 1 fencing
new members   5 

gfs mountgroups
name          big
id            0x10e1ad7f
flags         0x0000008c recover,mounted,blocked
change        member 1 joined 0 remove 1 failed 1 seq 2,3
members       5 


1298409894 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
1298409894 daemon node 2 max 0.0.0.0 run 0.0.0.0
1298409894 daemon node 2 join 1298409894 left 1298409570 local quorum 1298409889
1298409894 daemon_member 2 is clean
1298409894 reduce victim bull-02
1298409894 send_victim_done cg 3 flags 2 victim nodeid 2
1298409914 delay of 25s leaves 1 victims
1298409914 bull-04 not a cluster member after 20 sec post_join_delay
1298409914 receive_protocol from 5 max 1.1.1.0 run 1.1.1.1
1298409914 daemon node 5 max 1.1.1.0 run 1.1.1.0
1298409914 daemon node 5 join 1298409484 left 0 local quorum 1298409889
1298409914 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
1298409914 daemon node 2 max 1.1.1.0 run 0.0.0.0
1298409914 daemon node 2 join 1298409894 left 1298409570 local quorum 1298409889
1298409914 fencing node bull-04
1298409914 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409914 fence bull-04 failed
1298409914 failed to get dbus connection
1298409917 fencing node bull-04
1298409917 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409917 fence bull-04 failed
1298409917 failed to get dbus connection
1298409920 fencing node bull-04
1298409920 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298409920 fence bull-04 failed
1298409920 failed to get dbus connection
1298409921 fence bull-04 overridden by administrator intervention
1298409921 send_victim_done cg 3 flags 2 victim nodeid 4
1298409921 send_complete 5:3 flags 2 started 1 m 1 j 0 r 1 f 1
1298409921 receive_victim_done 5:3 flags 2 len 80
1298409921 receive_victim_done 5:3 no victim nodeid 2
1298409921 fenced:default conf 2 1 0 memb 2 5 join 2 left
1298409921 add_change cg 4 joined nodeid 2
1298409921 add_change cg 4 m 2 j 1 r 0 f 0
1298409921 check_ringid done cluster 468 cpg 2:468
1298409921 check_quorum done


Test run with patch, works correctly.  Output after last step of test:

[root@bull-05 ~]# group_tool
fence domain
member count  2   
victim count  0   
victim now    0   
master nodeid 5   
wait state    none
members       2 5 

dlm lockspaces
name          big 
id            0x2fd45270
flags         0x00000008 fs_reg
change        member 1 joined 0 remove 1 failed 1 seq 2,3
members       5   

name          clvmd
id            0x4104eefa
flags         0x00000000
change        member 1 joined 0 remove 1 failed 1 seq 2,3
members       5   

gfs mountgroups   
name          big 
id            0x10e1ad7f
flags         0x00000008 mounted
change        member 1 joined 0 remove 1 failed 1 seq 2,3
members       5   


1298408708 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
1298408708 daemon node 2 max 0.0.0.0 run 0.0.0.0
1298408708 daemon node 2 join 1298408708 left 1298408453 local quorum 1298408704
1298408708 daemon_member 2 is clean
1298408708 reduce victim bull-02
1298408708 send_victim_done cg 3 flags 2 victim nodeid 2
1298408728 delay of 24s leaves 1 victims
1298408728 bull-04 not a cluster member after 20 sec post_join_delay
1298408728 receive_protocol from 5 max 1.1.1.0 run 1.1.1.1
1298408728 daemon node 5 max 1.1.1.0 run 1.1.1.0
1298408728 daemon node 5 join 1298408288 left 0 local quorum 1298408704
1298408728 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
1298408728 daemon node 2 max 1.1.1.0 run 0.0.0.0
1298408728 daemon node 2 join 1298408708 left 1298408453 local quorum 1298408704
1298408728 fencing node bull-04
1298408728 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408728 fence bull-04 failed
1298408728 failed to get dbus connection
1298408731 fencing node bull-04
1298408731 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408731 fence bull-04 failed
1298408731 failed to get dbus connection
1298408734 fencing node bull-04
1298408734 fence bull-04 dev 0.0 agent /bin/false result: error from agent
1298408734 fence bull-04 failed
1298408734 failed to get dbus connection
1298408749 fence bull-04 overridden by administrator intervention
1298408749 send_victim_done cg 3 flags 2 victim nodeid 4
1298408749 send_complete 5:3 flags 2 started 1 m 1 j 0 r 1 f 1
1298408749 receive_victim_done 5:3 flags 2 len 80
1298408749 receive_victim_done 5:3 no victim nodeid 2
1298408749 receive_victim_done 5:3 remove victim 2 time 1298408708 how 2
> The line above is missing in the debug log from the broken demonstration
1298408749 fenced:default conf 2 1 0 memb 2 5 join 2 left
1298408749 add_change cg 4 joined nodeid 2
1298408749 add_change cg 4 m 2 j 1 r 0 f 0
1298408749 check_ringid done cluster 440 cpg 2:440
1298408749 check_quorum done

Comment 7 Nate Straz 2011-04-11 14:54:54 UTC
Got through 40+ iterations of revolver over the weekend.  Marking this verified.

Comment 8 errata-xmlrpc 2011-05-19 12:54:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html