180538 – gfs recovery could happen before fencing

Bug 180538 - gfs recovery could happen before fencing

Summary: gfs recovery could happen before fencing

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-02-08 21:01 UTC by David Teigland
Modified:	2009-04-16 20:30 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-05-04 18:19:38 UTC
Embargoed:

Attachments	(Terms of Use)

Description David Teigland 2006-02-08 21:01:32 UTC

Description of problem:

This is a bug in the way SM manages multiple recovery events.
A specific arrangement would be required to see this:

- nodes A,B,C,D,E are in the cluster
- nodes A,B,C,D,E are in the fence domain (FD)
- A,B,C are using gfs X
- C,D,E are using gfs Y

The bug is possible on node C if A fails creating a
recovery event for X (rev1), and just after that D fails
creating a recovery event for Y (rev2).  If the two nodes
fail at once there won't be a problem because a single
recovery event will be created.  The timing of the
consecutive failures would need to be just right.

The problem arises when the group representing the
fence domain (FD) is moved from rev1 into rev2.  This
makes the groups in rev2 depend on FD recovery, but
removes the dependecy of rev1 groups on FD recovery.
In actual fact, both rev1 and rev2 groups depend on
FD recovery, but the code has no way right now to
make two rev's depend on the same group.

When the FD dependency is removed from rev1, recovery
for the higher level groups in rev1 (which are the dlm
and gfs groups for X) goes ahead without waiting for
FD recovery to finish.

Both A and D will still be fenced, and given how recovery
works it's likely to happen before gfs recovery on X
begins.  But, if gfs-X recovery happens to start before
A is fenced, and A isn't really dead and comes back to
life and writes to X, then X could be corrupted.  If
manual fencing is used, then it becomes very likely that
recovery for gfs-X happens before A is fenced, and you
have to hope A won't come back to life and write to X.

Version-Release number of selected component (if applicable):


How reproducible:

I doubt anyone has seen this in practice.  The arrangement
of fs mounts is unusual and there are multiple places in the
process where a special timing of events is needed.

Steps to Reproduce:
1. see above, using fence_manual helps a lot
2.
3.
  
Actual results:
you'll see gfs-X recovery happen before A is fenced

Expected results:
gfs-X recovery won't happen until after A is fenced

Additional info:

Comment 1 David Teigland 2006-02-20 20:28:58 UTC

I haven't come up with any simple fixes to this problem.
We'll have to see how complex the solution I have in mind
ends up being.

Comment 2 David Teigland 2006-05-04 18:19:38 UTC

Added this description as a comment in the code it affects.
Is fixed in RHEL5 code.

Note You need to log in before you can comment on or make changes to this bug.