Red Hat Bugzilla – Bug 208954
groupd doesn't support mixed recovery with join/leave events
Last modified: 2009-04-16 18:49:26 EDT
Description of problem:
The groupd daemon does not currently support mixing
recovery with join and leave events.
Version-Release number of selected component (if applicable):
RHEL5 Beta 1 plus cluster development code from 01 Oct 2006.
Difficult to recreate, but we've seen it occasionally with the
'revolver' test from QE.
Steps to Reproduce:
After running revolver without gfs for several hours, system
Assertion failed on line 218 of file app.c
which means it got an empty recovery set for "add_recovery_set"
associated with nodeid 2, which is system 'merit'.
Further analysis by Dave Teigland found this:
1159570695 0:default process_node_join 2
1159570695 0:default cpg add node 2 total 3
1159570695 0:default make_event_id 200030001 nodeid 2 memb_count 3 type 1
1159570695 0:default queue join event for nodeid 2
1159570696 0:default confchg left 0 joined 1 total 4
1159570696 0:default process_node_join 3
1159570696 0:default cpg add node 3 total 4
1159570696 0:default queue_app_join: current event 3 300030003 FAIL_START_WAIT
1159570696 0:default make_event_id 300040001 nodeid 3 memb_count 4 type 1
1159570696 0:default queue join event for nodeid 3
1159570696 0:default queued ev 2 200030001 JOIN_BEGIN
1159570700 0:default confchg left 1 joined 0 total 3
1159570700 0:default confchg removed node 2 reason 3
1159570700 0:default process_node_down 2
This shows that node 2 joined and then failed, four seconds later, while
groupd was still processing the joins for it and others.
Devel ACK for RHEL 5.0.0 Beta 2
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release. Product Management has requested further review
of this request by Red Hat Engineering. This request is not yet committed for
inclusion in release.
QE ack for RHEL5.
The changes I've been working on in this area are tested and
checked in now. Recoveries mixed with joins do work in some
scenarios now. There will be more work to do here.
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.