Bug 599747 - groupd should work around incorrect cpg confchg reason
groupd should work around incorrect cpg confchg reason
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.7
All Linux
low Severity medium
: rc
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-03 16:16 EDT by David Teigland
Modified: 2010-11-09 08:29 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-08-05 17:23:44 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch (3.57 KB, text/plain)
2010-06-03 16:20 EDT, David Teigland
no flags Details

  None (edit)
Description David Teigland 2010-06-03 16:16:34 EDT
Description of problem:

When another groupd process fails, e.g. killall -9 groupd,
the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN.
(openais bz 599654)
This means groupd processes the removal of the node as a
clean shutdown instead of recovering it as a failure.

Work around this by detecting when a node is removed from the
groupd cpg but is still a member of other cpg's for fenced/dlm/gfs,
and treat the "leave" as a "procdown" in that case.

This workaround probably obviates the workaround for bz 521817,
but leaves it in place.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 David Teigland 2010-06-03 16:20:36 EDT
Created attachment 419564 [details]
patch

    groupd: recover for groupd failure
    
    bz 599747
    
    When another groupd process fails, e.g. killall -9 groupd,
    the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN.
    This means groupd processes the removal of the node as a
    clean shutdown instead of recovering it as a failure.
    
    Work around this by detecting when a node is removed from the
    groupd cpg but is still a member of other cpg's for fenced/dlm/gfs,
    and treat the "leave" as a "procdown" in that case.
    
    This workaround probably obviates the workaround for bz 521817,
    "groupd: clean up leaving failed node" but leaves it in place.
Comment 2 David Teigland 2010-06-03 16:32:11 EDT
The test I'm using is simply:

on nodes 1-4:
service cman start
service clvmd start
mount gfs

on node1: killall -9 groupd

Without the patch in comment 1, all daemons on node1 exit except for aisexec.
nodes 2-4 show:

[root@z2 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
dlm              1     clvmd    00010004 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
dlm              1     vedder0  00020001 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
gfs              2     vedder0  00010001 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]

If you next killall -9 aisexec on node1, then nodes 2-4 just remove node1 without any recovery, due to the combination of two things:

1. openais reporting the gropud failure is a LEAVE rather than PROCDOWN, as reported in bug 599654

2. the groupd patch in bug 521817

The simple removal of node1 without recovery is a dangerous result that could lead to gfs corruption.

Using the patch in comment 1:  when nodes 2-4 see groupd killed on node1, they will kill aisexec on node1 and recover everything properly.
Comment 3 David Teigland 2010-06-14 12:05:31 EDT
This patch should be expedited for any releases where bug 521817 has been
fixed.  The fix for bug 521817 on its own is dangerous because it opens the possibility of a failed node not being recovered.
Comment 4 David Teigland 2010-08-05 17:23:44 EDT
With bug 599654 fixed, this workaround is not needed.

Note You need to log in before you can comment on or make changes to this bug.