Bug 218103 - groupd doesn't handle nodes that fail again before first recovery is done
Summary: groupd doesn't handle nodes that fail again before first recovery is done
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-12-01 20:06 UTC by David Teigland
Modified: 2009-04-16 22:49 UTC (History)
1 user (show)

Fixed In Version: RC
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-02-08 01:14:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description David Teigland 2006-12-01 20:06:54 UTC
Description of problem:

The management of "recovery sets" in groupd isn't smart enough to handle
at least a couple situations where a node fails, returns and fails again
before recovery is done for the first failure.

Example 1 observed on smoke cluster once:
salem killed from revolver
fenced fences salem
winston killed by revolver
fenced fences winston
merit killed by revolver
quorum lost, fencing delayed
winston rejoins cluster
quorum regained, fenced continues
fenced begins fencing merit
merit rejoins cluster
fence_apc against merit fails (reason unknown)
fence_apc against merit succeeds (on retry)
merit down, killed by fencing
fenced reports success against merit 

recovery for merit is initiated while the earlier recovery for merit is
still in progress

Example 2:
- groupd process (with no current groups) is killed, this causes
a recovery set to be created (on another node) from the cpg callback,
but since the node didn't die there will never be a cman nodedown
callback to clear away the recovery set
- groupd is restarted
- groupd process is killed again, this tries to create a recovery
set for the node again from the cpg callback, this triggers an
assertion failure.

We need groupd to distinguish between groupd on a remote node
failing due to the node going down vs just the groupd process
exiting.  If the process exits and the node wasn't in any groups
we don't care about it and don't want a recovery set; if the node
_was_ in any groups we need to kill the node via cman_kill_node()
so we'll get a proper cman nodedown and can go through recovery
for it.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 RHEL Program Management 2007-02-08 01:14:17 UTC
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.


Comment 3 Nate Straz 2007-12-13 17:22:01 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.


Note You need to log in before you can comment on or make changes to this bug.