Bug 180538 - gfs recovery could happen before fencing
gfs recovery could happen before fencing
Status: CLOSED NEXTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-02-08 16:01 EST by David Teigland
Modified: 2009-04-16 16:30 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-04 14:19:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description David Teigland 2006-02-08 16:01:32 EST
Description of problem:

This is a bug in the way SM manages multiple recovery events.
A specific arrangement would be required to see this:

- nodes A,B,C,D,E are in the cluster
- nodes A,B,C,D,E are in the fence domain (FD)
- A,B,C are using gfs X
- C,D,E are using gfs Y

The bug is possible on node C if A fails creating a
recovery event for X (rev1), and just after that D fails
creating a recovery event for Y (rev2).  If the two nodes
fail at once there won't be a problem because a single
recovery event will be created.  The timing of the
consecutive failures would need to be just right.

The problem arises when the group representing the
fence domain (FD) is moved from rev1 into rev2.  This
makes the groups in rev2 depend on FD recovery, but
removes the dependecy of rev1 groups on FD recovery.
In actual fact, both rev1 and rev2 groups depend on
FD recovery, but the code has no way right now to
make two rev's depend on the same group.

When the FD dependency is removed from rev1, recovery
for the higher level groups in rev1 (which are the dlm
and gfs groups for X) goes ahead without waiting for
FD recovery to finish.

Both A and D will still be fenced, and given how recovery
works it's likely to happen before gfs recovery on X
begins.  But, if gfs-X recovery happens to start before
A is fenced, and A isn't really dead and comes back to
life and writes to X, then X could be corrupted.  If
manual fencing is used, then it becomes very likely that
recovery for gfs-X happens before A is fenced, and you
have to hope A won't come back to life and write to X.

Version-Release number of selected component (if applicable):


How reproducible:

I doubt anyone has seen this in practice.  The arrangement
of fs mounts is unusual and there are multiple places in the
process where a special timing of events is needed.

Steps to Reproduce:
1. see above, using fence_manual helps a lot
2.
3.
  
Actual results:
you'll see gfs-X recovery happen before A is fenced

Expected results:
gfs-X recovery won't happen until after A is fenced

Additional info:
Comment 1 David Teigland 2006-02-20 15:28:58 EST
I haven't come up with any simple fixes to this problem.
We'll have to see how complex the solution I have in mind
ends up being.
Comment 2 David Teigland 2006-05-04 14:19:38 EDT
Added this description as a comment in the code it affects.
Is fixed in RHEL5 code.

Note You need to log in before you can comment on or make changes to this bug.