Bug 750314

Summary: fenced/dlm_controld: fix handling of startup partition merge
Product: Red Hat Enterprise Linux 6 Reporter: David Teigland <teigland>
Component: clusterAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.2CC: ccaulfie, cluster-maint, djansa, jpayne, lhh, rpeterso, rsteiger, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cluster-3.0.12.1-27.el6 Doc Type: Bug Fix
Doc Text:
Cause: a cluster partition and merge during startup fencing was not detected correctly. Consequence: dlm lockspace operations are stuck. Fix: detect and handle this event correctly. Result: dlm lockspace operations are not stuck.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 13:58:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 756082    
Attachments:
Description Flags
fenced patch
none
dlm_controld patch none

Description David Teigland 2011-10-31 16:46:03 UTC
This problem was reported by Radek Steiger:
http://post-office.corp.redhat.com/archives/cluster-list/2011-October/msg00096.html
http://post-office.corp.redhat.com/archives/cluster-list/2011-October/msg00111.html

The problem is that a cluster partition+merge occurs during cluster startup before the first node completes startup fencing.  fenced and dlm_controld do not handle this scenario correctly.

This is the method I used to recreate the problem and test the fixes:

<?xml version="1.0"?>
<cluster name="bull" config_version="1">
<dlm log_debug="1"/>
<clusternodes>
<clusternode name="bull-01" nodeid="1"/>
<clusternode name="bull-02" nodeid="2"/>
<clusternode name="bull-04" nodeid="4"/>
</clusternodes>
</cluster>

1,2: service cman start setup
1,2: cman_tool join
2: fenced
1: fenced
2: dlm_controld
1: dlm_controld
2: fence_tool join (will enter loop failing startup fencing of bull-04)
1: fence_tool join
1,2: dlm_tool join foo (will block)
1,2: create partition between 1,2
1,2: wait for partition be detected
1,2: remove partition between 1,2
2: fence_ack_manual -n bull-04

At this point, without any fixes:
- fenced is ok, but for incorrect reasons
- dlm_controld is confused by the partition merge
- dlm_tool join is stuck due to dlm_controld confusion

Comment 1 David Teigland 2011-10-31 18:12:04 UTC
Created attachment 531008 [details]
fenced patch

see comment in patch

Comment 2 David Teigland 2011-10-31 18:12:32 UTC
Created attachment 531009 [details]
dlm_controld patch

see comment in patch

Comment 3 David Teigland 2011-10-31 18:22:28 UTC
The two patches make the sequence above work correctly.

Also verified that the sequence works as expected when the fence_ack_manual is done before the partition (i.e. one node needs to be reset).

Also verified that two other historically difficult partition+merge tests still work as expected:

test 1
------
- nodes 1,2,3,4
- all: no fencing configured
- all: service cman start
- all: dlm_tool join foo
- use iptables to create network partition 1 | 2,3,4
- wait for partition to be detected
- remove network partition resulting in merge 1,2,3,4
- 2,3,4: should kill corosync on node 1 automatically
- 1: reboot
- 1: service cman start
- 1: dlm_tool join foo

test 2
------
- nodes 1,2,3,4
- all: no fencing configured
- all: service cman start
- all: dlm_tool join foo
- use iptables to create network partition 1,2 | 3,4
- wait for partition to be detected
- remove network partition resulting in merge 1,2,3,4
- 1,2: reboot  (or 3,4)
- 1,2: service cman start
- 1,2: dlm_tool join foo

Comment 5 David Teigland 2012-03-01 22:09:27 UTC
pushed to RHEL6 branch

Comment 7 David Teigland 2012-03-01 22:38:16 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: a cluster partition and merge during startup fencing was not detected correctly.
Consequence: dlm lockspace operations are stuck.
Fix: detect and handle this event correctly.
Result: dlm lockspace operations are not stuck.

Comment 9 Justin Payne 2012-05-22 19:03:48 UTC
Verified in cluster-3.0.12.1-32.el6.x86_64

Cluster.conf:

<?xml version="1.0"?>
<cluster name="dash" config_version="1">
<dlm log_debug="1"/>
<clusternodes>
<clusternode name="dash-01" nodeid="1"/>
<clusternode name="dash-02" nodeid="2"/>
<clusternode name="dash-03" nodeid="3"/>
</clusternodes>
</cluster>

Steps in Description

Comment 11 errata-xmlrpc 2012-06-20 13:58:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0861.html