Bug 750314 - fenced/dlm_controld: fix handling of startup partition merge
Summary: fenced/dlm_controld: fix handling of startup partition merge
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 756082
TreeView+ depends on / blocked
 
Reported: 2011-10-31 16:46 UTC by David Teigland
Modified: 2012-06-20 13:58 UTC (History)
8 users (show)

(edit)
Cause: a cluster partition and merge during startup fencing was not detected correctly.
Consequence: dlm lockspace operations are stuck.
Fix: detect and handle this event correctly.
Result: dlm lockspace operations are not stuck.
Clone Of:
(edit)
Last Closed: 2012-06-20 13:58:27 UTC


Attachments (Terms of Use)
fenced patch (3.12 KB, text/plain)
2011-10-31 18:12 UTC, David Teigland
no flags Details
dlm_controld patch (1.97 KB, text/plain)
2011-10-31 18:12 UTC, David Teigland
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2012:0861 normal SHIPPED_LIVE cluster and gfs2-utils bug fix and enhancement update 2013-05-30 19:58:20 UTC

Description David Teigland 2011-10-31 16:46:03 UTC
This problem was reported by Radek Steiger:
http://post-office.corp.redhat.com/archives/cluster-list/2011-October/msg00096.html
http://post-office.corp.redhat.com/archives/cluster-list/2011-October/msg00111.html

The problem is that a cluster partition+merge occurs during cluster startup before the first node completes startup fencing.  fenced and dlm_controld do not handle this scenario correctly.

This is the method I used to recreate the problem and test the fixes:

<?xml version="1.0"?>
<cluster name="bull" config_version="1">
<dlm log_debug="1"/>
<clusternodes>
<clusternode name="bull-01" nodeid="1"/>
<clusternode name="bull-02" nodeid="2"/>
<clusternode name="bull-04" nodeid="4"/>
</clusternodes>
</cluster>

1,2: service cman start setup
1,2: cman_tool join
2: fenced
1: fenced
2: dlm_controld
1: dlm_controld
2: fence_tool join (will enter loop failing startup fencing of bull-04)
1: fence_tool join
1,2: dlm_tool join foo (will block)
1,2: create partition between 1,2
1,2: wait for partition be detected
1,2: remove partition between 1,2
2: fence_ack_manual -n bull-04

At this point, without any fixes:
- fenced is ok, but for incorrect reasons
- dlm_controld is confused by the partition merge
- dlm_tool join is stuck due to dlm_controld confusion

Comment 1 David Teigland 2011-10-31 18:12:04 UTC
Created attachment 531008 [details]
fenced patch

see comment in patch

Comment 2 David Teigland 2011-10-31 18:12:32 UTC
Created attachment 531009 [details]
dlm_controld patch

see comment in patch

Comment 3 David Teigland 2011-10-31 18:22:28 UTC
The two patches make the sequence above work correctly.

Also verified that the sequence works as expected when the fence_ack_manual is done before the partition (i.e. one node needs to be reset).

Also verified that two other historically difficult partition+merge tests still work as expected:

test 1
------
- nodes 1,2,3,4
- all: no fencing configured
- all: service cman start
- all: dlm_tool join foo
- use iptables to create network partition 1 | 2,3,4
- wait for partition to be detected
- remove network partition resulting in merge 1,2,3,4
- 2,3,4: should kill corosync on node 1 automatically
- 1: reboot
- 1: service cman start
- 1: dlm_tool join foo

test 2
------
- nodes 1,2,3,4
- all: no fencing configured
- all: service cman start
- all: dlm_tool join foo
- use iptables to create network partition 1,2 | 3,4
- wait for partition to be detected
- remove network partition resulting in merge 1,2,3,4
- 1,2: reboot  (or 3,4)
- 1,2: service cman start
- 1,2: dlm_tool join foo

Comment 5 David Teigland 2012-03-01 22:09:27 UTC
pushed to RHEL6 branch

Comment 7 David Teigland 2012-03-01 22:38:16 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: a cluster partition and merge during startup fencing was not detected correctly.
Consequence: dlm lockspace operations are stuck.
Fix: detect and handle this event correctly.
Result: dlm lockspace operations are not stuck.

Comment 9 Justin Payne 2012-05-22 19:03:48 UTC
Verified in cluster-3.0.12.1-32.el6.x86_64

Cluster.conf:

<?xml version="1.0"?>
<cluster name="dash" config_version="1">
<dlm log_debug="1"/>
<clusternodes>
<clusternode name="dash-01" nodeid="1"/>
<clusternode name="dash-02" nodeid="2"/>
<clusternode name="dash-03" nodeid="3"/>
</clusternodes>
</cluster>

Steps in Description

Comment 11 errata-xmlrpc 2012-06-20 13:58:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0861.html


Note You need to log in before you can comment on or make changes to this bug.