Red Hat Bugzilla – Bug 180538
gfs recovery could happen before fencing
Last modified: 2009-04-16 16:30:56 EDT
Description of problem:
This is a bug in the way SM manages multiple recovery events.
A specific arrangement would be required to see this:
- nodes A,B,C,D,E are in the cluster
- nodes A,B,C,D,E are in the fence domain (FD)
- A,B,C are using gfs X
- C,D,E are using gfs Y
The bug is possible on node C if A fails creating a
recovery event for X (rev1), and just after that D fails
creating a recovery event for Y (rev2). If the two nodes
fail at once there won't be a problem because a single
recovery event will be created. The timing of the
consecutive failures would need to be just right.
The problem arises when the group representing the
fence domain (FD) is moved from rev1 into rev2. This
makes the groups in rev2 depend on FD recovery, but
removes the dependecy of rev1 groups on FD recovery.
In actual fact, both rev1 and rev2 groups depend on
FD recovery, but the code has no way right now to
make two rev's depend on the same group.
When the FD dependency is removed from rev1, recovery
for the higher level groups in rev1 (which are the dlm
and gfs groups for X) goes ahead without waiting for
FD recovery to finish.
Both A and D will still be fenced, and given how recovery
works it's likely to happen before gfs recovery on X
begins. But, if gfs-X recovery happens to start before
A is fenced, and A isn't really dead and comes back to
life and writes to X, then X could be corrupted. If
manual fencing is used, then it becomes very likely that
recovery for gfs-X happens before A is fenced, and you
have to hope A won't come back to life and write to X.
Version-Release number of selected component (if applicable):
I doubt anyone has seen this in practice. The arrangement
of fs mounts is unusual and there are multiple places in the
process where a special timing of events is needed.
Steps to Reproduce:
1. see above, using fence_manual helps a lot
you'll see gfs-X recovery happen before A is fenced
gfs-X recovery won't happen until after A is fenced
I haven't come up with any simple fixes to this problem.
We'll have to see how complex the solution I have in mind
ends up being.
Added this description as a comment in the code it affects.
Is fixed in RHEL5 code.