469212 – fence group stuck during recovery testing

Bug 469212 - fence group stuck during recovery testing

Summary: fence group stuck during recovery testing

Keywords:
Status:	CLOSED DUPLICATE of bug 575952
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	258121
Blocks:
TreeView+	depends on / blocked

Reported:	2008-10-30 16:21 UTC by Nate Straz
Modified:	2016-04-26 13:34 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-08-17 19:59:32 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
The output from `group_tool dump fence` from all nodes. (6.28 KB, text/plain) 2008-10-30 16:21 UTC, Nate Straz	no flags	Details
group_tool output from all nodes, case #2 (12.49 KB, text/plain) 2008-11-03 14:28 UTC, Nate Straz	no flags	Details
full group_tool dump output from all nodes, case #2 (62.17 KB, application/x-gzip) 2008-11-03 19:08 UTC, Nate Straz	no flags	Details
analysis of another instance (8.38 KB, text/plain) 2010-03-22 19:03 UTC, David Teigland	no flags	Details
log of previous debugging (1.98 KB, text/plain) 2010-03-22 19:10 UTC, David Teigland	no flags	Details
patch (2.40 KB, text/plain) 2010-08-05 22:10 UTC, David Teigland	no flags	Details
View All

Description Nate Straz 2008-10-30 16:21:40 UTC

Created attachment 321955 [details]
The output from `group_tool dump fence` from all nodes.

Description of problem:

While running revolver I ran into a case where the fence group did not recover.
The cluster had six nodes and four where shot.  These remaining two nodes are stuck in FAIL_ALL_STOPPED.

I'll attach the output of `group_tool dump fence` from all nodes.


Version-Release number of selected component (if applicable):
cman-2.0.94-1.el5

How reproducible:
Unknown
  
Actual results:


Expected results:


Additional info:

Comment 1 Nate Straz 2008-11-03 14:28:34 UTC

Created attachment 322305 [details]
group_tool output from all nodes, case #2

I was able to reproduce this again during recovery testing.  It took many iterations to hit.  I'm attaching the output of `group_tool` and `group_tool dump fence` for all nodes.  tank-01, tank-03, tank-04, and morph-01 were all shot during this iteration of revolver.

Comment 2 Nate Straz 2008-11-03 19:08:33 UTC

Created attachment 322353 [details]
full group_tool dump output from all nodes, case #2

Comment 3 David Teigland 2009-03-24 20:57:54 UTC

I looked at this back in November, but only up to the point of finding that it was a revolver double-kill situation, after which I started looking at how revolver could more effectively avoid these uncontrolled kills.

Like bug 258121, this one is also triggered by uncontrolled or duplicate kills by revolver, where a node is killed via a revolver reboot, and then comes back,
rejoining the cluster just before fencing gets around to power cycling it in response to the initial reboot kill.  Revolver tries to prevent this, but can't avoid it reliably as long as it uses reboot to kill nodes.  Using an iptables rule to block cluster manager traffic would be a good way to avoid the double-kill.

In bug 258121, the double-kill can cause groupd to process back to back events in reverse order on different nodes.  It's a rather fundamental design flaw in groupd which we will probably be unable to fix.

This bug is caused by the same fundamental problems, but is a bit different.  We might be able to do some work arounds to handle this one, at least some of the time.  The changes are right in the middle of a big pile of other existing attempts at working around variations of the same problem.  The key question is whether a patch can be made to pinpoint this specific issue without jeopardizing common conditions.

Comment 4 David Teigland 2009-03-30 22:14:25 UTC

For now let's NAK this bz for 5.4.  I suspect it may not be fixable like the similar bz 258121.  If I do come up with a possible fix, we may not want it -- the change is likely to carry a high regression risk for common cases that aren't justified for such a rare recovery scenario (seen once in QE revolver test). Furthermore, we couldn't reliably QE the result because the scenario actually occured when revolver didn't run as intended (node rebooted too quickly for revolver).

Comment 7 David Teigland 2010-03-22 19:03:45 UTC

Created attachment 401848 [details]
analysis of another instance

Nate hit this again on 8 west nodes.

4 of 8 nodes were killed, the cluster lost quorum, when a fifth node rejoined quorum was restored, and the 4 nodes in the fence domain immediately fenced one of the 4 killed nodes that was just about to rejoin, causing the double kill in the midst of all the joins.  node 2 which was joining the domain thought that its join was going to be processed, followed by the second failure (of node 8), but the others removed node 8 before processing the join of node 2.  This reversed order of event processing is what groupd gets confused and stuck on.

I believe past testing included a positive post_fail_delay which would probably mask this specific double kill.  Ideally post_join_delay would be used here since a node had just joined the cluster, but the delay is based on the last join/fail event in the *group*, not the cluster.  May try to address that in a separate bug.

Comment 8 David Teigland 2010-03-22 19:10:10 UTC

Created attachment 401851 [details]
log of previous debugging

These are the notes I took when I debugged this issue back in Mar 2009, but it looks like I never recorded them in the bz.

Comment 10 David Teigland 2010-08-05 21:42:35 UTC

The core groupd bug here is not realistically fixable in RHEL5, but I think we can avoid it by making the fenced post_join_delay more intelligent.  I'll try to put together an improved post_join_delay patch (i.e. make it apply after both cpg joins and cluster joins), and probably just repurpose this bz for it.

Comment 11 David Teigland 2010-08-05 22:10:51 UTC

Created attachment 436990 [details]
patch

This makes fenced use post_join_delay after a node joins the cluster giving it quorum.  I've not done any testing of it.

Comment 12 David Teigland 2010-08-05 22:17:05 UTC

I'd forgotten that I'd created bug 575952 to track the patch in comment 11 and mentioned in comment 10.  I'll use that bz for the fenced post_join_delay work, and if it's successful, I'll probably suggest closing this bug.

Comment 13 David Teigland 2010-08-17 19:59:32 UTC

This is not a duplicate of bug 575952, but the fix in that bug should generally avoid the specific sequence of events that produced this bug.

*** This bug has been marked as a duplicate of bug 575952 ***

Note You need to log in before you can comment on or make changes to this bug.