Bug 218103

Summary:	groupd doesn't handle nodes that fail again before first recovery is done
Product:	Red Hat Enterprise Linux 5	Reporter:	David Teigland <teigland>
Component:	cman	Assignee:	David Teigland <teigland>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.0	CC:	cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RC	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-02-08 01:14:17 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Teigland 2006-12-01 20:06:54 UTC

Description of problem:

The management of "recovery sets" in groupd isn't smart enough to handle
at least a couple situations where a node fails, returns and fails again
before recovery is done for the first failure.

Example 1 observed on smoke cluster once:
salem killed from revolver
fenced fences salem
winston killed by revolver
fenced fences winston
merit killed by revolver
quorum lost, fencing delayed
winston rejoins cluster
quorum regained, fenced continues
fenced begins fencing merit
merit rejoins cluster
fence_apc against merit fails (reason unknown)
fence_apc against merit succeeds (on retry)
merit down, killed by fencing
fenced reports success against merit 

recovery for merit is initiated while the earlier recovery for merit is
still in progress

Example 2:
- groupd process (with no current groups) is killed, this causes
a recovery set to be created (on another node) from the cpg callback,
but since the node didn't die there will never be a cman nodedown
callback to clear away the recovery set
- groupd is restarted
- groupd process is killed again, this tries to create a recovery
set for the node again from the cpg callback, this triggers an
assertion failure.

We need groupd to distinguish between groupd on a remote node
failing due to the node going down vs just the groupd process
exiting.  If the process exits and the node wasn't in any groups
we don't care about it and don't want a recovery set; if the node
_was_ in any groups we need to kill the node via cman_kill_node()
so we'll get a proper cman nodedown and can go through recovery
for it.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 RHEL Program Management 2007-02-08 01:14:17 UTC

A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.

Comment 3 Nate Straz 2007-12-13 17:22:01 UTC

Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.