Description of problem: The groupd daemon does not currently support mixing recovery with join and leave events. Version-Release number of selected component (if applicable): RHEL5 Beta 1 plus cluster development code from 01 Oct 2006. How reproducible: Difficult to recreate, but we've seen it occasionally with the 'revolver' test from QE. Steps to Reproduce: Actual results: Expected results: Additional info:
Additional information: After running revolver without gfs for several hours, system 'camel' reported: Assertion failed on line 218 of file app.c which means it got an empty recovery set for "add_recovery_set" associated with nodeid 2, which is system 'merit'. Further analysis by Dave Teigland found this: 1159570695 0:default process_node_join 2 1159570695 0:default cpg add node 2 total 3 1159570695 0:default make_event_id 200030001 nodeid 2 memb_count 3 type 1 1159570695 0:default queue join event for nodeid 2 1159570696 0:default confchg left 0 joined 1 total 4 1159570696 0:default process_node_join 3 1159570696 0:default cpg add node 3 total 4 1159570696 0:default queue_app_join: current event 3 300030003 FAIL_START_WAIT 1159570696 0:default make_event_id 300040001 nodeid 3 memb_count 4 type 1 1159570696 0:default queue join event for nodeid 3 1159570696 0:default queued ev 2 200030001 JOIN_BEGIN 1159570700 0:default confchg left 1 joined 0 total 3 1159570700 0:default confchg removed node 2 reason 3 1159570700 0:default process_node_down 2 This shows that node 2 joined and then failed, four seconds later, while groupd was still processing the joins for it and others.
Devel ACK for RHEL 5.0.0 Beta 2
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering. This request is not yet committed for inclusion in release.
QE ack for RHEL5.
The changes I've been working on in this area are tested and checked in now. Recoveries mixed with joins do work in some scenarios now. There will be more work to do here.
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.