Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 546082

Summary:

groupd stuck by partition merge

Product:

Red Hat Enterprise Linux 5

Reporter:

David Teigland <teigland>

Component:

cman

Assignee:

David Teigland <teigland>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.4

CC:

ccaulfie, cluster-maint, djansa, edamato, jkortus

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

cman-2.0.115-24.el5.src.rpm

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-03-30 08:42:23 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
patch to work around	none

Description David Teigland 2009-12-09 22:45:50 UTC

Description of problem:

I thought that we were handling partition merging adequately in RHEL5; the disallowed feature from cman was a big part of developing that.  I'd really like to go back at some point and see if partition merging worked, and if so what broke it.  It may be that we never properly tested partition merging when working on it in the past, and it never fully worked. I'm now running the tests I used for developing partition merging handling in cluster3.

A simple partition merge test now causes groupd to become stuck.

1. memb=1,2,3
2. memb=1 / memb=2,3 (partition)
3. 2,3 begin fencing 1 due to failure
4. memb=1,2,3 (merge)
5. 2,3 kill 1 due to cman disallowed, or 1 rebooted due to fencing
6. memb=2,3

groupd on 2,3 does not get a cman callback about the disallowed node merged in step 4, or about it failing in step 6.  This was an intentional part of the disallowed state design.  groupd on 2,3 *does* get cpg callbacks about node 1 joining in step 4 and failing in step 6.  groupd waits for cman and cpg to be in sync on the same events before processing a recovery event, so after seeing the cpg node failure in step 6, it waits to see the same node failure from cman, which never arrives.

The only way to resolve the resulting groupd hang on 2,3 is to restart them.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2009-12-09 23:12:24 UTC

Just to set expectations, we're not going to have solid, comprehensive handling
of partition merging in RHEL5.  What we can do is make a "best effort" attempt
at detecting some simple partition merging scenarios and making them
recoverable without a full cluster restart.

Incidentally, one of the goals of the "disallowed" cman feature was to detect
and mask partition merges in cman to make them invisible to the higher levels,
so we could ignore them altogether.  Cman is doing it's job in that regard. The
problem in this bug is that the higher levels are still exposed to merges
through the cpg layer, and that's causing groupd confusion.  I'm not sure if
some of the changes to groupd through the releases have changed things so that
the cpg merge events have started to cause problems when the didn't before. 
There isn't anything obvious to me looking at the commits.

Comment 2 David Teigland 2009-12-09 23:27:09 UTC

In the example above, nodes 2 and 3 will show group_tool -v like the following after the merge and killing of node 1:

fence            0     default  00010002 FAIL_ALL_STOPPED 1 100020003 -1
[1 2 3]

and group_tool dump will show:

1260402364 0:default process_current_event 100020003 1 FAIL_ALL_STOPPED
1260402364 no cman update for recovery_set 1 quorate 1

and if cman kills the merged node before it's fenced, then something like this will appear in /var/log/messages:

openais[3120]: [MAIN ] Killing node node-01 because it has rejoined the cluster with existing state

Comment 3 David Teigland 2009-12-09 23:28:54 UTC

Created attachment 377342 [details]
patch to work around

this patch seems to resolve the problem in the simple partition merge test case I'm using

Comment 5 David Teigland 2009-12-15 20:23:06 UTC

pushed to RHEL55 branch

http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=23d5cbe5dfcf20040814a09aafa33faf9f6f66e9

Comment 10 errata-xmlrpc 2010-03-30 08:42:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html