Red Hat Bugzilla – Bug 143487
fencing doesn't happen without quorum
Last modified: 2009-08-31 14:16:01 EDT
Description of problem:
When fenced needs to fence a node, it must read the information in
CCS. If at that time the cluster is inquorate, the CCS read request
will be refused from the ccsd server. This becomes an issue when node
failures cause the cluster to quorum and the failed nodes have no
mechanism for reseting themselves. In such a case, the failed nodes
will not be able to reset themselves, nor will of the other nodes
since access to CCS is refused, resulting in a hung cluster until an
administrator notices and takes action.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. start cluster
2. kill nodes until quorum is lost
3. watch cluster do nothing
cluster hangs until quorum is restablished
inquorate nodes should be able to fence failed nodes in order to
automatically recover from failures
Jon Brassow, Mike Tilstra and I had a conversation on a potential work
around. My idea was to make fenced access ccsd with the force option
so that it disregaurds whether the cluster is quorate. Jon suggested
that if we did that, we should try to make fenced first test if it was
quorate before fencing. If it was, it would be allowed to fence the
failed node immidiately. If the cluster was inquorate, then fenced
would sleep a configurable amount of time before forcefully reading
it's information from ccs to fence the node. The reason for this
delay is to give the quorate portion of the cluster a better chance at
fencing the failed nodes in the event of a netsplit.
This still might not be enough. I don't know when fencing actions
take place. In the event of multiple node failures, how are the nodes
fenced? If they are fenced sequentially, when is the second node
fenced? If it's upon success of the first node getting fenced, the
second node will be fenced without problem. If the second node is
fenced upon recovery of the entire service manager stack, then quorum
needs to be restablished before the second node can be fenced. This
is something that is not doable if more than one node is required to
rejoin the cluster to restablish quorum.
See also, comment #4 in bugzilla #138858.
bug #138858 comment #4 would indicate that this is a design flaw.
There needs to be a way to automatically recover from failures where
quorum is lost. The critical step that is missing in Dave's
explanation of how things work in his comment is just before "- X is
brought back into the cluster". Without a step that can reset a node
when quorum is lost, the cluster is dead in the water until an
Sounds like we might want some kind of "node resetting system". I'm
not sure how to define this system just yet but have some ideas. The
fencing system was specifically designed _not_ to be a node resetting
system, although some of the agents could be used by such a thing.
The fencing system could coexist quite easily with a node resetting
system if one was created.
[Note that the original "expected results" above would be quite bad.
A single node with a failed network would fence all the remaining
nodes in an otherwise perfectly running cluster. Quorum is precisely
the thing you need to make these fencing decisions.]
Obviously watchdog/hangcheck timers reduce the likelihood that nodes
get themselves into a hung and unreset state.
I'm not sure what state this should be in, but it's not a bug.
Maybe a feature request for a node resetting system.
your statement in comment #3 is wrong... either that or I'm just not
understanding how fencing works in the new design.
In the event of a failed network, fencing never would (at least
should) succeed with our current agents:
1. All network power fencing agents require a network. failure to
communicate with the network results in a failure to successfully
fence. The result is that the daemon should retry until it succeeds.
2. SAN fencing agents require a netowrk, see above results.
3. manual fencing agents require either a network or manual
intervention. If the user screws up, it's their own fault.
In the event that nodes are still able to communicate with the fencing
device, but not the other nodes, the locks protect the nodes as new
lock operations should not occur unless the cluster is quorate. If
power fencing is used, the nodes will more than likely be able to
recover from the failure, although extra fencing operations might
occur. Isn't this why we have always pushed power fencing? This is
how fencing has worked in our production releases. Changes to this
behavior are going to be percieved as a bug, especially if user
intervention now becomes required on systems that were once fully
The example isn't really the main point, but the failed network may be
only momentary and there are a variety of failure cases that could be
used. The real point is that the fencing design is based on new and
different concepts than those we're familiar with. These new concepts
and effects are found throughout the new infrastruture and definately
require some serious adjustment in thinking and understanding. All
this is inherited from the fundamental change from single-point
server-based management to symmetric, fully-distributed management
which requires doing things very differently.
If someone views the fencing system as something that always resets
a node as soon as it fails, they will need to adjust their
expectations -- that's not the definition of fencing. If it's
something they _want_ (as you say, people have appreciated taking
advantage of this effect in the past when using power fencing) we
should definately find a way to do what they want. That's why I
suggested a "node resetting system" which shouldn't be difficult
to write up (or recommend watchdogs). I want to make the users
happy, we just need to find the correct way to do it.
I should explain in depth about how and why the architecture is
designed this way and how that's related to the specific effects we
do or don't see. Much of it is in the SCA document, but the emphasis
there is on fundamental concepts and inner-working and only touches
on outward effects a user will notice.
The fencing concepts have not changed. When a node transitions to an
unknown state, it is isolated from the storage to prevent corruption.
What has changed is the implementation of when the fencing occurs.
Moving to ASSIGNED to look into this.
Users needing this feature can continue to use gulm. Alternatively,
they could add weight to nodes in such a manner that gulms fencing
behavior is emulated.
Corey has brought it to my attention that I'm wrong in comment #8.
I forgot that the original point of the bug is that fence_node (and
fenced) can not access the information in ccs to fence a node unless
ccs is quorate.
In the event that a gulm loses quorum, it will no longer be able to
fence nodes since fence_node will fail due to it's failure to read
ccs. If gulm should lose quorum because a server fails, that server
must be fenced before it can log back in and restablish quorum.
Right now, there is no way to do this cleanly and it will cause gulm
clusters to hang. In the event that a server failure causes gulm to
lose quorum, the following steps need to be taken:
# 1. Make sure that all failed expired nodes have been rebooted
# (this will require manual operator action) Only after the nodes
# are in a working good state should we proceed to the next step
# 2. The arbitrating node will more than likely have pending
# fence_actions since ccsd is no longer able to serve fencing
# 3. on the arbitrating node, rename /sbin/fence_node:
mv /sbin/fence_node /sbin/fence_node.orig
# 4. make symlink to /bin/true so that when lock_gulmd calls
# fence_node, it succeeds:
ln -s /bin/true /sbin/fence_node
# 5. kill any remaining fence_node processes. lock_gulmd will exec()
# any fence operation that failed (reported non-zero exit status)
# 6. make sure that all servers have been fenced by looking at the
# gulm_tool nodelist
gulm_tool nodelist localhost
# 7. once all servers have been fenced, remove the /sbin/fence_node
# symlink and rename the backup fence_node to /sbin/fence_node
mv /sbin/fence_node.orig /sbin/fence_node
# 8. start lock_gulmd on the newly fenced nodes
This work around is a gross kludge. One of two things will probably
need to be done:
1. Make fence_node access ccsd with the force flag
2. Add a gulm_tool command to lock_gulmd that allows a node to
acknowledge that a node has been successfully fenced even though
fence_node wasn't able to report success.
( 3. A third option that is quite impractical due to the code change
required is allow gulm to login when it is in the expired state. This
would introduce a major change in critical gulm code paths,
introducing a very high degree of rsik. So high, that I think it
shouldn't even be concidered as a viable option)
(4. make the above fence_node symlink hack the preferred method... so
ugly it's laughable)
Option #2 is quite dangerous in that we create a mechanism for the
user to override a failing fencing operation. My guess is that most
users that encounter a failed fence, will not think that the reason it
happened was their fault, and will therefor try to override the
fencing operation, leading to corruption.
I think that leaves us with option #1 as being the safest option to
I'm adding this back on the blocker list because the current state of
things is totally unacceptable. Moving to NEEDINFO because we need to
decide how to go about fixing this.
(Changing Priority and Severity to be HIGH so that is shows up as a
different color in the query results page ;)
fence_node now has the capability to fence without the cluster being
quorate. It does, however, require extra args:
prompt> fence_node -O -c <cluster_name> node_name
The idea of adding -O to have fence_node force a connect to ccs is
fine. I'm not keen on the implementation, though, so I've redone
it cleaner and simpler.
In my new version I've removed the cluster_name arg and removed
all the convoluted connecting/checking fence_node was doing. If
any of that is _really_ necessary I need to hear specifically what
the problem is it's solving and then we'll handle it from there.
The reason the cluster name is required is that when CCS is used with the force option, the
cluster.conf file may (1) not exist or (2) be incorrect. If the cluster is non-quorate, ccsd
will broadcast for a cluster.conf - which, if the above conditions exist, may result in
getting a cluster.conf from a different cluster. Giving a cluster name ensures you don't get
a cluster.conf for a different cluster. This is desirable especially in the case where
customers are vetting their fencing setup initially.
The convoluted connecting/checking code is there to allow good error reporting in the
event that (1) users specify a cluster name that is different from the machines current
view, (2) ccsd has not been started, or (3) the cluster is non-quorate and the user is not
using the override flag.
Originally, I had intended to just have '-c <cluster_name>', as its presence would allow a
force connect (to the correct cluster). In essence, the force is implied. However, in
discussions with Adam, we determined that it would be more useful to the users to have a
-O option. The combination of options will lead the user to a more complete
understanding of who a cluster node is allowed to fence and when, and with good error
reporting, the user will easily figure out what needs to be done.
I'm sure Adam can provide his reasons for this requirement as well.
I'm most interested in learning the specific scenarios the bug
was raised over, i.e. gulm cluster looses quorum and needs to
run fence_node. If we start there I think we'll get a clearer
picture of what the real problem is we're solving -- e.g. I don't
think it's reasonable to consider that a cluster.conf file won't
exist on a node where lock_gulmd is running, has lost quorum,
and now needs to fence someone.
If there are other fence_node usages we want to consider, that's
fine, but let's raise them as a new RFE.
without --clustername to fence_node the fencing system is unreliable
gulm needs to be able to have a fence operation complete whenever it
gulm does not require quorum to fence
fence_node looks up node information in ccs to fence the node
ccs requires quorum to let fence_node complete
problem: ccs's requirement for quorum breaks gulm's requirement for
fence_node to complete regaurdless of clustersate.
sollution: add option to fence_node to allow forceful read of ccs.
why does fence_node need the clustername? because cluster may not be
quorate when fencing operation is required. forcing ccs to connect
without a cluster name means that it can grab a new cluster.conf from
a quorate cluster, correct? If so... that's exactly the behavior we
are trying to prevent by including the --clustername parameter.
I believe that the parameters for fence_node should looks as follows:
fence_node -O -c clustername nodename
-O is needed to enforce in people's minds that are calling fence_node
that it is forcing a ccs read from ccs. Implicitly defining that "-c
clustername" forces a read maybe confusing and dangerous. Hence why I
thought -O should be required. That then begs the question, is there
any value in "fence_node -c clustername nodename". I think that the
answer is yes in that, if it doesn't try to force a read from ccs, it
can be usefull in determining some common error conditions.
clustername is required to prevent ccs grabbing a new cluster.conf
file from the network when it is not quorate.
ccsd is already running with a cluster.conf
lock_gulmd has already connected to ccsd and probably
has read some information from cluster.conf
quorum is lost
lock_gulmd calls fence_node which also needs to connect
to ccsd to get info from cluster.conf
fence_node -O uses ccs_force_connect to connect despite
the lack of quorum
are you saying that at this point ccsd may go out and look
for a new cluster.conf file? and that ccsd may grab a new
cluster.conf file from an entirely different cluster if we
don't include a clustername arg? (Even though there's an
existing cluster.conf with a clustername in it?)
- If not, then it seems we don't need a clustername arg, we just
need the -O to cause a force_connect (which we now have.)
- If so, then I believe we need to step back and look at the
situation more closely, because I don't think that's what we want.
Are we going to be patching over a deeper problem by adding a
clustername arg to fence_node instead of tackling the real issue?
On the surface, it looks to me like this may call for at least two
variations of ccs_connect:
1. force connect and ignore quorum requirements, but nothing more.
this is probably all we want for this bug
2. force connect, ignore quorum requirements and look for a new
cluster.conf. This is what we want when cman_tool starts up the
cluster manager and it makes sense to provide an optional
3. force connect, ignore all ccs logic and give me direct access to
whatever exists in the local cluster.conf file. I've asked for
this before as it would be _extremely_ helpful for testing/
debugging. This would also probably work for fixing this bug.
Code has been checked into CVS that addresses the issue where gulm
can't fence a node because the cluster is lacking quorum.
Technically, this can be moved to MODIFIED. However...
If it is moved to MODIFIED without adding the option to add the
cluster name to fence_node, a new bug will need to be opened
addressing the case where quorum is lost and the cluster.conf is in an
invalid state. An invalid cluster.conf might arise from a situation
where a user realizes that there is an error in the cluster.conf file
that contains incorrect (yet valid information), where upon trying to
fix it, they may corrupt the file (something that shouldn't happen if
the gui is used, but still a possibility that needs addressing). In
such a case, a differing cluster.conf is possible. If for some reason
the node is also listed in that cluster.conf, bad things could arise.
> 3. force connect, ignore all ccs logic and give me direct access to
> whatever exists in the local cluster.conf file. I've asked for
> this before as it would be _extremely_ helpful for testing/
> debugging. This would also probably work for fixing this bug.
I agree that this is something that would be useful; a new RFE should
be opened for it. In which case, this bug should be left in ASSIGNED
and set to be depend on the RFE.
No matter how this bug is resolved, it should not be possible for the
cluster.conf to be overwritten, regardless of what the user has done
to the cluster.conf file. As it is now, the code in CVS has no
protection against this sort of error, which is why I reopened it to
ASSIGNED in the first place. I'll leave it up to Jon and Dave to
decide on how they want to move forward with this bug.
The problem with gulm calling fence_node should be fixed. I
think this has already been verified.
If there are problems with updating cluster.conf in a running cluster
they should be new bugs, probably opened against ccs or the gui.