Bug 706464 - groupd waiting for recovery sets
Summary: groupd waiting for recovery sets
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-20 17:20 UTC by David Teigland
Modified: 2012-07-23 15:13 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-23 15:13:22 UTC
Target Upstream Version:


Attachments (Terms of Use)
groupd log excerpts (6.24 KB, text/plain)
2011-05-20 17:22 UTC, David Teigland
no flags Details
all groupd logs (817.23 KB, text/plain)
2011-05-20 17:25 UTC, David Teigland
no flags Details

Description David Teigland 2011-05-20 17:20:38 UTC
Description of problem:

Another problem with groupd recovery in the situation where:
node X fails
node X rejoins while the others are recovering it 
node X fails again while others are recovering it from its first failure

As we've seen with previous bugs like this, the groupd handling is so fragile
and imperfect in these scenarios, that a fix can easily just cause different, even worse, problems than the original.  We'll have to see when once I have a patch we can consider.

In this case, the problem is around the per-failure recovery sets that try to ensure all groups are stopped for a given failure before any are started.
In this case a recovery set for the failed node is being processed when it fails again.  All groups in the recovery set finish recovery, and the rs is removed.  The second failure has caused group 0:default to go back to step 1 of recovery, where it tries to wait for all groups in the recovery set to be stopped, but no recovery set exists any longer, so it waits forever on that check.  Specifically the "if (!found) return 0;" in all_levels_all_stopped().
I don't know in what other cases that !found condition may occur or be needed.  If we can figure that out, and be certain there are none, then removing it (and returning 1 instead of 0) should fix the problem.  If there are cases where that
current behavior is needed, then we would need to figure out some other way to distinguish this specific scenario from any others.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2011-05-20 17:22:32 UTC
Created attachment 500103 [details]
groupd log excerpts

group_tool output and relevant portions of groupd logs.

Comment 2 David Teigland 2011-05-20 17:25:06 UTC
Created attachment 500104 [details]
all groupd logs

Comment 3 RHEL Program Management 2012-05-15 18:57:33 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.


Note You need to log in before you can comment on or make changes to this bug.