Bug 436984 - groupd processes message from dead node
groupd processes message from dead node
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.2
All Linux
low Severity low
: rc
: ---
Assigned To: David Teigland
GFS Bugs
: TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-11 10:42 EDT by David Teigland
Modified: 2009-04-16 19:03 EDT (History)
4 users (show)

See Also:
Fixed In Version: RHBA-2008-0347
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 11:58:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description David Teigland 2008-03-11 10:42:08 EDT
Description of problem:

In this fix for bug 258121:
http://sources.redhat.com/git/?p=cluster.git;a=commitdiff;h=70294dd8b717de89f2d168c0837c011648908558

we began taking nodedown events via the groupd cpg, instead of via the per
group cpg.  Messages still come in via the per group cpg.  I believe that
that opened the possibility of processing a message from a node after
processing the nodedown for it.

In Nate's revolver test, we saw it happen; revolver killed nodes 1,2,3,
leaving just node 4:

1205198713 cman: lost quorum
1205198713 cman: node 1 removed
1205198713 add_recovery_set_cman nodeid 1
1205198713 cman: node 2 removed
1205198713 add_recovery_set_cman nodeid 2
1205198713 cman: node 3 removed
1205198713 add_recovery_set_cman nodeid 3
1205198713 groupd confchg total 1 left 3 joined 0
...
1205198713 0:default process_node_down 1
1205198713 0:default cpg del node 1 total 3 - down
1205198713 0:default make_event_id 100030003 nodeid 1 memb_count 3 type 3
1205198713 0:default queue recover event for nodeid 1
...
1205198713 0:default process_node_down 2
1205198713 0:default cpg del node 2 total 2 - down
1205198713 0:default extend_recover_event for 1 with node 2
...
1205198713 0:default process_node_down 3
1205198713 0:default cpg del node 3 total 1 - down
1205198713 0:default extend_recover_event for 1 with node 3
...
1205198713 0:default confchg left 3 joined 0 total 1
1205198713 0:default confchg removed node 1 reason 3
1205198713 0:default confchg removed node 2 reason 3
1205198713 0:default confchg removed node 3 reason 3
...
1205198713 0:default set current event to recovery for 1
1205198713 0:default process_current_event 100030003 1 FAIL_BEGIN
1205198713 0:default action for app: stop default
...
1205198713 0:default mark_node_started: event not starting 12 from 2
1205198713 0:default mark node 2 started
1205198713 0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1

So, after nodedown callbacks for nodes 1,2,3, groupd gets a "started" message
for the 0:default group (the default fence domain) from node 2.  It logs
the error message, but still processes the start message, which causes the
group to never reach the proper state.  group_tool -v shows:

fence            0     default         00010001 FAIL_STOP_WAIT 1 100030003 1
[1 2 3 4]


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 6 errata-xmlrpc 2008-05-21 11:58:54 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html

Note You need to log in before you can comment on or make changes to this bug.