Red Hat Bugzilla – Bug 162725
Full gfs access allowed to node in inquorate state
Last modified: 2009-04-16 16:30:34 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Description of problem:
With a three node cluster configured for lock_dlm and manual fencing and a single gfs file system mounted on each of the three nodes, we powered off two of the nodes. We are still able to read and write files on the gfs file system although clustat shows the cluster inquorate with only one node remaining in the cluster. This is with the GA release of GFS 6.1/Cluster Manager 4.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Bring up the cluster with three nodes and a gfs file system mounted
2.Power off two of the nodes
3.Attempt to read or write a file on the gfs file system on the remaining node
Actual Results: Read and write successful
Expected Results: Read or at least write access blocked
May understanding is that write access at least should be blocked if the cluster is inquorate to guard again a possible "split-brain" condition.
Apparently, this happens (and is more easily reproducible?) if you shut two of
the nodes down cleanly.
That is correct, Lon. I used the shutdown command. If I pull the power cord
on one node, the remaining two nodes cannot access GFS until a
fence_ack_manual command is issued. I will try pulling the power cord on two
nodes so that the remaining node is inquorate and see what happens.
This looks correct. GFS activity is only blocked if a node that
has the fs mounted fails. This can result in the somewhat unexpected
situation seen here where the cluster loses quorum but fs's mounted by
nodes that haven't failed continue running normally.
If there was one fs mounted by the three nodes, and two failed,
then that fs would indeed be "blocked" until quorum was regained.
[GFS activity is never fully blocked, though, even in this case,
as gfs will continue to use locks it already holds -- only new
lock requests are blocked.]
If you get a split-brain cluster, gfs being blocked or not doesn't
matter. The quorate cluster partition will always fence the inquorate
partition. This guarantees that no inquorate nodes will modify any fs's
after fencing, so enabling gfs on the quorate nodes can happen.
Dave - thanks for the explanation. That helps a lot. I will change the
status to NOTABUG.
*** Bug 170926 has been marked as a duplicate of this bug. ***
There is a scenario where this behaviour may be harmful.
Consider a node having mounted GFS and kicking off other nodes due to a
partially malfunctioning network connection (no fencing, just CMAN).
Sep 29 12:06:42 zs01 kernel: CMAN: removing node zs03 from the cluster : Missed
too many heartbeats
Sep 29 12:06:43 zs03 kernel: CMAN: Being told to leave the cluster by node 1
(Taken from bug 169693)
At the very end this node has removed the others and keeps the filesystem
mounted in an inquorate state. Now the other nodes form a new cluster and
remount GFS, while the sick node has GFS mnounted in another cluster partition.
I believe this is what happened in the quoted bug. The true cause of failure
seems to be the broken cluster network connectivity, but GFS should have a
safegueard against that.
Would it make sense to add an option to totally block the filesystem when quorum
is lost? After hitting the mentioned bug it would certainly make me sleep better
again. After all, when a cluster has lost its quorum, there is some havoc going
on, and GFS trying to save whatever is left to save doesn't seem right.
a more specific example:
1. nodes in cluster: A,B,C
2. A has GFS file system "foo" mounted
3. B,C have no GFS file systems mounted
4. A has a failed network connection
5. cluster partitions
6. A remains in original, now inquorate, cluster
7. A continues to use GFS foo, uneffected
8. B,C form a new, quorate cluster
9. B,C start fenced prior to mounting GFS foo
10. A is fenced by B or C
11. B,C can mount GFS foo
To avoid the problems in steps 4-5 you could use
bonding with two network connections (although there's
some doubt about how well this works at the moment.)
There could be problems that cause CMAN to see a failed
network connection when in fact the connection is healthy.
To work around these you can play with the CMAN tunables
(hello_timer, deadnode_timeout, max_retries) or possibly
run network intesive applications over a separate network.
I'm having difficulty connecting the last paragraph of
comment 6 to the situation I've described:
- What does "blocking" a GFS file system mean specifically
in this context and what would it accomplish?
- Losing quorum should be considered a perfectly normal event
in the course of running a cluster, it doesn't imply any kind of
bug or malfunction -- it's just an operating mode in which there's
a greater limit on the actions the cluster may perform.
For one I'm looking for explanations/fixes to bug 169693. And if this bug is
related, then if node A's activities on the file system had been blocked, the
filesystem wouldn't had diverged into two different views.
Also the scenario above assumes that in step A you have fenced configured w/o
clean_start. If fencing non-participating nodes at start-up is a manifest part
of the cluster integrity concept, then that degree of freedom should be removed
from the user.
Losing quorum should not be considered a perfectly normal event. By definition
the cluster does not exist anymore, either through hardware/software fault or
human misconfiguration. If a cluster needs to be degraded in its number of
members there are mechanisms to do so w/o dropping into inquorate state.
Just to mention bug 169693 again: If the cluster concept would take the
inquorate state more seriously and would block activities immediately and not
rely on being fenced in the future, the split view of GFS would not had happened.
Even if the true cause for the GFS corruption in bug 169693 might turn out to be
bug 164331, a blocking upon entering inquorate state would had saved the
I see this kind of safeguard as an important asset for an enterprise grade
filesystem, or not?
in comment #8: "in step A" -> "in step 10"
clean_start should never be set to 1 in cluster.conf, it
can lead to fs corruption. References to it will probably
be removed. When a new cluster gains quorum and the fence
domain is first activated, nodes in an unknown state (not
in the quorate cluster) must be fenced because we don't know
if they are in a cluster partition of their own using the fs.
If they are and they are not fenced, the fs can easily be
corrupted. So, if nodes on both sides of a partitioned cluster
ever have the same fs mounted at once, then the fencing
system was not used correctly (e.g. clean_start was 1)
or there's a bug somewhere in the fencing system. I suspect
that this clean_start=1 is the root cause of your problems.
While gfs is mounted on a node (whether the cluster is
quorate or not) there is no way for us to "block the file
system" as you suggest.
Quorum and fencing solve two different problems. Quorum
adds "sanity" to a split-brain scenario by allowing one
partition to go ahead and do work. Fencing forcibly
prevents nodes that are not participating in the quorate
cluster and could potentially still be writing to the
fs from doing so. If we could be certain that a node in
an inquorate cluster (or a node that's been removed from
the cluster) would not write to the fs, then we wouldn't
need to fence it.
And there are improvements that we could add along those
lines. In our example above, some new userland daemon
on node A could recognize that it's the only remaining node
in an inquorate cluster. Based on this knowledge it could
decide to do something akin to 'gfs_tool withdraw' to forcibly
shut down local access to the fs and return errors to any processes
using the fs. If that was successful, it could then record
somewhere (probably on shared storage) that it has safely and
completely shutdown/unmounted the fs. When B,C start up they
could safely bypass fencing A if they saw this record that
A had in fact safely shut down fs access. That kind of feature
would be really neat to work on given the time...