Bug 599747

Summary: groupd should work around incorrect cpg confchg reason
Product: Red Hat Enterprise Linux 5 Reporter: David Teigland <teigland>
Component: cmanAssignee: David Teigland <teigland>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.7CC: cluster-maint, djansa, edamato, jkortus
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-05 21:23:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch none

Description David Teigland 2010-06-03 20:16:34 UTC
Description of problem:

When another groupd process fails, e.g. killall -9 groupd,
the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN.
(openais bz 599654)
This means groupd processes the removal of the node as a
clean shutdown instead of recovering it as a failure.

Work around this by detecting when a node is removed from the
groupd cpg but is still a member of other cpg's for fenced/dlm/gfs,
and treat the "leave" as a "procdown" in that case.

This workaround probably obviates the workaround for bz 521817,
but leaves it in place.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-06-03 20:20:36 UTC
Created attachment 419564 [details]
patch

    groupd: recover for groupd failure
    
    bz 599747
    
    When another groupd process fails, e.g. killall -9 groupd,
    the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN.
    This means groupd processes the removal of the node as a
    clean shutdown instead of recovering it as a failure.
    
    Work around this by detecting when a node is removed from the
    groupd cpg but is still a member of other cpg's for fenced/dlm/gfs,
    and treat the "leave" as a "procdown" in that case.
    
    This workaround probably obviates the workaround for bz 521817,
    "groupd: clean up leaving failed node" but leaves it in place.

Comment 2 David Teigland 2010-06-03 20:32:11 UTC
The test I'm using is simply:

on nodes 1-4:
service cman start
service clvmd start
mount gfs

on node1: killall -9 groupd

Without the patch in comment 1, all daemons on node1 exit except for aisexec.
nodes 2-4 show:

[root@z2 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
dlm              1     clvmd    00010004 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
dlm              1     vedder0  00020001 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
gfs              2     vedder0  00010001 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]

If you next killall -9 aisexec on node1, then nodes 2-4 just remove node1 without any recovery, due to the combination of two things:

1. openais reporting the gropud failure is a LEAVE rather than PROCDOWN, as reported in bug 599654

2. the groupd patch in bug 521817

The simple removal of node1 without recovery is a dangerous result that could lead to gfs corruption.

Using the patch in comment 1:  when nodes 2-4 see groupd killed on node1, they will kill aisexec on node1 and recover everything properly.

Comment 3 David Teigland 2010-06-14 16:05:31 UTC
This patch should be expedited for any releases where bug 521817 has been
fixed.  The fix for bug 521817 on its own is dangerous because it opens the possibility of a failed node not being recovered.

Comment 4 David Teigland 2010-08-05 21:23:44 UTC
With bug 599654 fixed, this workaround is not needed.