Red Hat Bugzilla – Bug 183430
fenced will try to fence a node forever if there is a problem fencing
Last modified: 2009-04-16 16:11:27 EDT
Description of problem:
If there is a problem with your fence script, or your fence device is not
reachable and fenced goes to fence something, fenced will loop forever trying to
fence that particular node, so when you go to reboot, if the fence device has
not been brought up, or if the script is still not working, the node will be
unable to join the fence domain as the fencing node will still be stuck in its loop.
Version-Release number of selected component (if applicable):
I wrote my patch against fence-1.32.6
Steps to Reproduce:
1. Get a script that does not work, or take your fence device off line
2. Cause one of your nodes to be fenced (ie unplugging its network cable)
The fencing node will loop trying to fence that node
This patch allows the user to set a limit on the number of times the node will
fail to fence the node before just removing the fenced node from its list.
Created attachment 125432 [details]
This patch allows the user to set the number of times fenced will try to fence a node
This patch adds a fence_fail_count option, which if set, will let fenced try to
fence a particular node fence_fail_count times before it simply removes the
node from its node list. By default this is zero which leaves the behavior as
hmm, i thought i fixed this already. instead of count = 0, it should be count = 1. when i get in tomorrow
i'll repost with the proper value.
This is a problem other people have had, too.
The problem with the patch is that it's not safe to go on if the
node really does need to be fenced. You need a human to confirm that
fencing really can be skipped for the given node and then run a
command to tell fenced to give up.
If fenced is stuck trying to fence node1 for whatever reason and a
human recognizes that node1 is off or has been properly reset, then
the human can run a command like "fence_tool skip node1" to cause
fenced to give up and assume node1 doesn't need to be fenced.
A separate issue is if some nodes have been shut down and the admin
wants to run the cluster without them for some time. You don't want
the cluster to be fencing these nodes when it starts up. The proper
thing to do would be to remove unused nodes from cluster.conf, but
that can be a pain. There may be a nicer way of telling the system
to ignore these unused nodes... not sure what that might be yet.
I'm not saying this is a production worthy option to have, but for drastic
situations where you can't get your cluster to come back up or if you know the
fencing script does not work and you are waiting on support to get it fixed, its
a good thing to drop in there temporarily, kind of like the clean_start option.
This is why i made it so its just an option that by default does nothing, so
everything works the way it normaly does, and only comes into play if in the
extreme case its needed.
What is the main problem you're trying to work around?
Is it a situation where a new cluster forms and tries to fence nodes
that are shut down (i.e. they haven't and won't join the cluster)?
If that's the case, then you can use the clean start option by doing
'fence_tool join -c' and that should work around the problem for now.
Doing this causes fenced to not even try fencing the nodes that are
Or is this a situation where cluster members fail and fenced tries
to fence them, but a script or fence device is broken so fenced gets
into a loop failing to fence? If that's the case, then I still have
the concern described above and suggest we develop a command that
a user can run to tell fenced to quit trying.
Oh i missed that suggestion. Ok well I guess that covers all the situations
where this pops up, no need for this patch then so I wont bother reposting with
the fixes. thx much dave.
Um, from the comments above, I believe this behavior is needed. I am closing
this as notabug -- please feel free to reopen if you disagree.