Bug 706464

Summary: groupd waiting for recovery sets
Product: Red Hat Enterprise Linux 5 Reporter: David Teigland <teigland>
Component: cmanAssignee: David Teigland <teigland>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 5.5CC: cluster-maint, edamato, fdinitto, nstraz
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-23 15:13:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
groupd log excerpts
none
all groupd logs none

Description David Teigland 2011-05-20 17:20:38 UTC
Description of problem:

Another problem with groupd recovery in the situation where:
node X fails
node X rejoins while the others are recovering it 
node X fails again while others are recovering it from its first failure

As we've seen with previous bugs like this, the groupd handling is so fragile
and imperfect in these scenarios, that a fix can easily just cause different, even worse, problems than the original.  We'll have to see when once I have a patch we can consider.

In this case, the problem is around the per-failure recovery sets that try to ensure all groups are stopped for a given failure before any are started.
In this case a recovery set for the failed node is being processed when it fails again.  All groups in the recovery set finish recovery, and the rs is removed.  The second failure has caused group 0:default to go back to step 1 of recovery, where it tries to wait for all groups in the recovery set to be stopped, but no recovery set exists any longer, so it waits forever on that check.  Specifically the "if (!found) return 0;" in all_levels_all_stopped().
I don't know in what other cases that !found condition may occur or be needed.  If we can figure that out, and be certain there are none, then removing it (and returning 1 instead of 0) should fix the problem.  If there are cases where that
current behavior is needed, then we would need to figure out some other way to distinguish this specific scenario from any others.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2011-05-20 17:22:32 UTC
Created attachment 500103 [details]
groupd log excerpts

group_tool output and relevant portions of groupd logs.

Comment 2 David Teigland 2011-05-20 17:25:06 UTC
Created attachment 500104 [details]
all groupd logs

Comment 3 RHEL Program Management 2012-05-15 18:57:33 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.