Bug 183430 - fenced will try to fence a node forever if there is a problem fencing
fenced will try to fence a node forever if there is a problem fencing
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: fence (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jim Parsons
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-02-28 19:26 EST by Josef Bacik
Modified: 2009-04-16 16:11 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-04-25 08:02:58 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
This patch allows the user to set the number of times fenced will try to fence a node (4.18 KB, patch)
2006-02-28 19:29 EST, Josef Bacik
no flags Details | Diff

  None (edit)
Description Josef Bacik 2006-02-28 19:26:42 EST
Description of problem:
If there is a problem with your fence script, or your fence device is not
reachable and fenced goes to fence something, fenced will loop forever trying to
fence that particular node, so when you go to reboot, if the fence device has
not been brought up, or if the script is still not working, the node will be
unable to join the fence domain as the fencing node will still be stuck in its loop.

Version-Release number of selected component (if applicable):
I wrote my patch against fence-1.32.6


How reproducible:
Everytime

Steps to Reproduce:
1. Get a script that does not work, or take your fence device off line
2. Cause one of your nodes to be fenced (ie unplugging its network cable)
3.
  
Actual results:
The fencing node will loop trying to fence that node

Expected results:
This patch allows the user to set a limit on the number of times the node will
fail to fence the node before just removing the fenced node from its list.

Additional info:
Comment 1 Josef Bacik 2006-02-28 19:29:24 EST
Created attachment 125432 [details]
This patch allows the user to set the number of times fenced will try to fence a node

This patch adds a fence_fail_count option, which if set, will let fenced try to
fence a particular node fence_fail_count times before it simply removes the
node from its node list.  By default this is zero which leaves the behavior as
default.
Comment 2 Josef Bacik 2006-02-28 21:02:06 EST
hmm, i thought i fixed this already.  instead of count = 0, it should be count = 1.  when i get in tomorrow 
i'll repost with the proper value.
Comment 3 David Teigland 2006-03-01 11:55:03 EST
This is a problem other people have had, too.

The problem with the patch is that it's not safe to go on if the
node really does need to be fenced.  You need a human to confirm that
fencing really can be skipped for the given node and then run a
command to tell fenced to give up.

If fenced is stuck trying to fence node1 for whatever reason and a
human recognizes that node1 is off or has been properly reset, then
the human can run a command like "fence_tool skip node1" to cause
fenced to give up and assume node1 doesn't need to be fenced.

A separate issue is if some nodes have been shut down and the admin
wants to run the cluster without them for some time.  You don't want
the cluster to be fencing these nodes when it starts up.  The proper
thing to do would be to remove unused nodes from cluster.conf, but
that can be a pain.  There may be a nicer way of telling the system
to ignore these unused nodes... not sure what that might be yet.
Comment 4 Josef Bacik 2006-03-01 12:05:40 EST
I'm not saying this is a production worthy option to have, but for drastic
situations where you can't get your cluster to come back up or if you know the
fencing script does not work and you are waiting on support to get it fixed, its
a good thing to drop in there temporarily, kind of like the clean_start option.
 This is why i made it so its just an option that by default does nothing, so
everything works the way it normaly does, and only comes into play if in the
extreme case its needed.
Comment 5 David Teigland 2006-03-01 12:59:07 EST
What is the main problem you're trying to work around?

Is it a situation where a new cluster forms and tries to fence nodes
that are shut down (i.e. they  haven't and won't join the cluster)?
If that's the case, then you can use the clean start option by doing
'fence_tool join -c' and that should work around the problem for now.
Doing this causes fenced to not even try fencing the nodes that are
shut down.

Or is this a situation where cluster members fail and fenced tries
to fence them, but a script or fence device is broken so fenced gets
into a loop failing to fence?  If that's the case, then I still have
the concern described above and suggest we develop a command that
a user can run to tell fenced to quit trying.
Comment 6 Josef Bacik 2006-03-01 13:17:19 EST
Oh i missed that suggestion.  Ok well I guess that covers all the situations
where this pops up, no need for this patch then so I wont bother reposting with
the fixes.  thx much dave.
Comment 7 Jim Parsons 2006-04-25 08:02:58 EDT
Um, from the comments above, I believe this behavior is needed. I am closing
this as notabug -- please feel free to reopen if you disagree.

Note You need to log in before you can comment on or make changes to this bug.