Bug 208954 - groupd doesn't support mixed recovery with join/leave events
Summary: groupd doesn't support mixed recovery with join/leave events
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-02 19:18 UTC by Robert Peterson
Modified: 2009-04-16 22:49 UTC (History)
4 users (show)

Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-28 21:11:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Robert Peterson 2006-10-02 19:18:32 UTC
Description of problem:
The groupd daemon does not currently support mixing
recovery with join and leave events.

Version-Release number of selected component (if applicable):
RHEL5 Beta 1 plus cluster development code from 01 Oct 2006.

How reproducible:
Difficult to recreate, but we've seen it occasionally with the
'revolver' test from QE.

Steps to Reproduce:
  
Actual results:

Expected results:


Additional info:

Comment 1 Robert Peterson 2006-10-02 19:28:31 UTC
Additional information:
After running revolver without gfs for several hours, system
'camel' reported:
Assertion failed on line 218 of file app.c
which means it got an empty recovery set for "add_recovery_set"
associated with nodeid 2, which is system 'merit'.

Further analysis by Dave Teigland found this:

1159570695 0:default process_node_join 2
1159570695 0:default cpg add node 2 total 3
1159570695 0:default make_event_id 200030001 nodeid 2 memb_count 3 type 1
1159570695 0:default queue join event for nodeid 2
1159570696 0:default confchg left 0 joined 1 total 4
1159570696 0:default process_node_join 3
1159570696 0:default cpg add node 3 total 4
1159570696 0:default queue_app_join: current event 3 300030003 FAIL_START_WAIT
1159570696 0:default make_event_id 300040001 nodeid 3 memb_count 4 type 1
1159570696 0:default queue join event for nodeid 3
1159570696 0:default     queued ev 2 200030001 JOIN_BEGIN
1159570700 0:default confchg left 1 joined 0 total 3
1159570700 0:default confchg removed node 2 reason 3
1159570700 0:default process_node_down 2

This shows that node 2 joined and then failed, four seconds later, while
groupd was still processing the joins for it and others. 


Comment 2 Kiersten (Kerri) Anderson 2006-10-03 16:57:44 UTC
Devel ACK for RHEL 5.0.0 Beta 2

Comment 3 RHEL Program Management 2006-10-03 17:03:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 4 Jay Turner 2006-10-10 19:48:28 UTC
QE ack for RHEL5.

Comment 5 David Teigland 2006-10-12 14:32:47 UTC
The changes I've been working on in this area are tested and
checked in now.  Recoveries mixed with joins do work in some
scenarios now.  There will be more work to do here.


Comment 7 Nate Straz 2007-12-13 17:22:22 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.


Note You need to log in before you can comment on or make changes to this bug.