Bug 218103 - groupd doesn't handle nodes that fail again before first recovery is done
groupd doesn't handle nodes that fail again before first recovery is done
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2006-12-01 15:06 EST by David Teigland
Modified: 2009-04-16 18:49 EDT (History)
1 user (show)

See Also:
Fixed In Version: RC
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-02-07 20:14:17 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description David Teigland 2006-12-01 15:06:54 EST
Description of problem:

The management of "recovery sets" in groupd isn't smart enough to handle
at least a couple situations where a node fails, returns and fails again
before recovery is done for the first failure.

Example 1 observed on smoke cluster once:
salem killed from revolver
fenced fences salem
winston killed by revolver
fenced fences winston
merit killed by revolver
quorum lost, fencing delayed
winston rejoins cluster
quorum regained, fenced continues
fenced begins fencing merit
merit rejoins cluster
fence_apc against merit fails (reason unknown)
fence_apc against merit succeeds (on retry)
merit down, killed by fencing
fenced reports success against merit 

recovery for merit is initiated while the earlier recovery for merit is
still in progress

Example 2:
- groupd process (with no current groups) is killed, this causes
a recovery set to be created (on another node) from the cpg callback,
but since the node didn't die there will never be a cman nodedown
callback to clear away the recovery set
- groupd is restarted
- groupd process is killed again, this tries to create a recovery
set for the node again from the cpg callback, this triggers an
assertion failure.

We need groupd to distinguish between groupd on a remote node
failing due to the node going down vs just the groupd process
exiting.  If the process exits and the node wasn't in any groups
we don't care about it and don't want a recovery set; if the node
_was_ in any groups we need to kill the node via cman_kill_node()
so we'll get a proper cman nodedown and can go through recovery
for it.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 2 RHEL Product and Program Management 2007-02-07 20:14:17 EST
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.
Comment 3 Nate Straz 2007-12-13 12:22:01 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.