Bug 183430 - fenced will try to fence a node forever if there is a problem fencing
Summary: fenced will try to fence a node forever if there is a problem fencing
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: fence
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jim Parsons
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-03-01 00:26 UTC by Josef Bacik
Modified: 2009-04-16 20:11 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2006-04-25 12:02:58 UTC
Embargoed:


Attachments (Terms of Use)
This patch allows the user to set the number of times fenced will try to fence a node (4.18 KB, patch)
2006-03-01 00:29 UTC, Josef Bacik
no flags Details | Diff

Description Josef Bacik 2006-03-01 00:26:42 UTC
Description of problem:
If there is a problem with your fence script, or your fence device is not
reachable and fenced goes to fence something, fenced will loop forever trying to
fence that particular node, so when you go to reboot, if the fence device has
not been brought up, or if the script is still not working, the node will be
unable to join the fence domain as the fencing node will still be stuck in its loop.

Version-Release number of selected component (if applicable):
I wrote my patch against fence-1.32.6


How reproducible:
Everytime

Steps to Reproduce:
1. Get a script that does not work, or take your fence device off line
2. Cause one of your nodes to be fenced (ie unplugging its network cable)
3.
  
Actual results:
The fencing node will loop trying to fence that node

Expected results:
This patch allows the user to set a limit on the number of times the node will
fail to fence the node before just removing the fenced node from its list.

Additional info:

Comment 1 Josef Bacik 2006-03-01 00:29:24 UTC
Created attachment 125432 [details]
This patch allows the user to set the number of times fenced will try to fence a node

This patch adds a fence_fail_count option, which if set, will let fenced try to
fence a particular node fence_fail_count times before it simply removes the
node from its node list.  By default this is zero which leaves the behavior as
default.

Comment 2 Josef Bacik 2006-03-01 02:02:06 UTC
hmm, i thought i fixed this already.  instead of count = 0, it should be count = 1.  when i get in tomorrow 
i'll repost with the proper value.

Comment 3 David Teigland 2006-03-01 16:55:03 UTC
This is a problem other people have had, too.

The problem with the patch is that it's not safe to go on if the
node really does need to be fenced.  You need a human to confirm that
fencing really can be skipped for the given node and then run a
command to tell fenced to give up.

If fenced is stuck trying to fence node1 for whatever reason and a
human recognizes that node1 is off or has been properly reset, then
the human can run a command like "fence_tool skip node1" to cause
fenced to give up and assume node1 doesn't need to be fenced.

A separate issue is if some nodes have been shut down and the admin
wants to run the cluster without them for some time.  You don't want
the cluster to be fencing these nodes when it starts up.  The proper
thing to do would be to remove unused nodes from cluster.conf, but
that can be a pain.  There may be a nicer way of telling the system
to ignore these unused nodes... not sure what that might be yet.


Comment 4 Josef Bacik 2006-03-01 17:05:40 UTC
I'm not saying this is a production worthy option to have, but for drastic
situations where you can't get your cluster to come back up or if you know the
fencing script does not work and you are waiting on support to get it fixed, its
a good thing to drop in there temporarily, kind of like the clean_start option.
 This is why i made it so its just an option that by default does nothing, so
everything works the way it normaly does, and only comes into play if in the
extreme case its needed.

Comment 5 David Teigland 2006-03-01 17:59:07 UTC
What is the main problem you're trying to work around?

Is it a situation where a new cluster forms and tries to fence nodes
that are shut down (i.e. they  haven't and won't join the cluster)?
If that's the case, then you can use the clean start option by doing
'fence_tool join -c' and that should work around the problem for now.
Doing this causes fenced to not even try fencing the nodes that are
shut down.

Or is this a situation where cluster members fail and fenced tries
to fence them, but a script or fence device is broken so fenced gets
into a loop failing to fence?  If that's the case, then I still have
the concern described above and suggest we develop a command that
a user can run to tell fenced to quit trying.


Comment 6 Josef Bacik 2006-03-01 18:17:19 UTC
Oh i missed that suggestion.  Ok well I guess that covers all the situations
where this pops up, no need for this patch then so I wont bother reposting with
the fixes.  thx much dave.

Comment 7 Jim Parsons 2006-04-25 12:02:58 UTC
Um, from the comments above, I believe this behavior is needed. I am closing
this as notabug -- please feel free to reopen if you disagree.


Note You need to log in before you can comment on or make changes to this bug.