Bug 599747

Summary:

groupd should work around incorrect cpg confchg reason

Product:

Red Hat Enterprise Linux 5

Reporter:

David Teigland <teigland>

Component:

cman

Assignee:

David Teigland <teigland>

Status:

CLOSED NOTABUG

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.7

CC:

cluster-maint, djansa, edamato, jkortus

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-08-05 21:23:44 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
patch	none

Description David Teigland 2010-06-03 20:16:34 UTC

Description of problem:

When another groupd process fails, e.g. killall -9 groupd,
the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN.
(openais bz 599654)
This means groupd processes the removal of the node as a
clean shutdown instead of recovering it as a failure.

Work around this by detecting when a node is removed from the
groupd cpg but is still a member of other cpg's for fenced/dlm/gfs,
and treat the "leave" as a "procdown" in that case.

This workaround probably obviates the workaround for bz 521817,
but leaves it in place.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-06-03 20:20:36 UTC

Created attachment 419564 [details]
patch

    groupd: recover for groupd failure
    
    bz 599747
    
    When another groupd process fails, e.g. killall -9 groupd,
    the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN.
    This means groupd processes the removal of the node as a
    clean shutdown instead of recovering it as a failure.
    
    Work around this by detecting when a node is removed from the
    groupd cpg but is still a member of other cpg's for fenced/dlm/gfs,
    and treat the "leave" as a "procdown" in that case.
    
    This workaround probably obviates the workaround for bz 521817,
    "groupd: clean up leaving failed node" but leaves it in place.

Comment 2 David Teigland 2010-06-03 20:32:11 UTC

The test I'm using is simply:

on nodes 1-4:
service cman start
service clvmd start
mount gfs

on node1: killall -9 groupd

Without the patch in comment 1, all daemons on node1 exit except for aisexec.
nodes 2-4 show:

[root@z2 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
dlm              1     clvmd    00010004 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
dlm              1     vedder0  00020001 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]
gfs              2     vedder0  00010001 LEAVE_STOP_WAIT 1 100030002 1
[1 2 3 4]

If you next killall -9 aisexec on node1, then nodes 2-4 just remove node1 without any recovery, due to the combination of two things:

1. openais reporting the gropud failure is a LEAVE rather than PROCDOWN, as reported in bug 599654

2. the groupd patch in bug 521817

The simple removal of node1 without recovery is a dangerous result that could lead to gfs corruption.

Using the patch in comment 1:  when nodes 2-4 see groupd killed on node1, they will kill aisexec on node1 and recover everything properly.

Comment 3 David Teigland 2010-06-14 16:05:31 UTC

This patch should be expedited for any releases where bug 521817 has been
fixed.  The fix for bug 521817 on its own is dangerous because it opens the possibility of a failed node not being recovered.

Comment 4 David Teigland 2010-08-05 21:23:44 UTC

With bug 599654 fixed, this workaround is not needed.