Bug 599747 - groupd should work around incorrect cpg confchg reason
Summary: groupd should work around incorrect cpg confchg reason
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.7
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-03 20:16 UTC by David Teigland
Modified: 2010-11-09 13:29 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-08-05 21:23:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch (3.57 KB, text/plain)
2010-06-03 20:20 UTC, David Teigland
no flags Details

Description David Teigland 2010-06-03 20:16:34 UTC
Description of problem:

When another groupd process fails, e.g. killall -9 groupd,
the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN.
(openais bz 599654)
This means groupd processes the removal of the node as a
clean shutdown instead of recovering it as a failure.

Work around this by detecting when a node is removed from the
groupd cpg but is still a member of other cpg's for fenced/dlm/gfs,
and treat the "leave" as a "procdown" in that case.

This workaround probably obviates the workaround for bz 521817,
but leaves it in place.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-06-03 20:20:36 UTC
Created attachment 419564 [details]
patch

    groupd: recover for groupd failure
    
    bz 599747
    
    When another groupd process fails, e.g. killall -9 groupd,
    the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN.
    This means groupd processes the removal of the node as a
    clean shutdown instead of recovering it as a failure.
    
    Work around this by detecting when a node is removed from the
    groupd cpg but is still a member of other cpg's for fenced/dlm/gfs,
    and treat the "leave" as a "procdown" in that case.
    
    This workaround probably obviates the workaround for bz 521817,
    "groupd: clean up leaving failed node" but leaves it in place.

Comment 2 David Teigland 2010-06-03 20:32:11 UTC
The test I'm using is simply:

on nodes 1-4:
service cman start
service clvmd start
mount gfs

on node1: killall -9 groupd

Without the patch in comment 1, all daemons on node1 exit except for aisexec.
nodes 2-4 show:

[root@z2 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
dlm              1     clvmd    00010004 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
dlm              1     vedder0  00020001 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
gfs              2     vedder0  00010001 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]

If you next killall -9 aisexec on node1, then nodes 2-4 just remove node1 without any recovery, due to the combination of two things:

1. openais reporting the gropud failure is a LEAVE rather than PROCDOWN, as reported in bug 599654

2. the groupd patch in bug 521817

The simple removal of node1 without recovery is a dangerous result that could lead to gfs corruption.

Using the patch in comment 1:  when nodes 2-4 see groupd killed on node1, they will kill aisexec on node1 and recover everything properly.

Comment 3 David Teigland 2010-06-14 16:05:31 UTC
This patch should be expedited for any releases where bug 521817 has been
fixed.  The fix for bug 521817 on its own is dangerous because it opens the possibility of a failed node not being recovered.

Comment 4 David Teigland 2010-08-05 21:23:44 UTC
With bug 599654 fixed, this workaround is not needed.


Note You need to log in before you can comment on or make changes to this bug.