Description of problem: If there is a problem with your fence script, or your fence device is not reachable and fenced goes to fence something, fenced will loop forever trying to fence that particular node, so when you go to reboot, if the fence device has not been brought up, or if the script is still not working, the node will be unable to join the fence domain as the fencing node will still be stuck in its loop. Version-Release number of selected component (if applicable): I wrote my patch against fence-1.32.6 How reproducible: Everytime Steps to Reproduce: 1. Get a script that does not work, or take your fence device off line 2. Cause one of your nodes to be fenced (ie unplugging its network cable) 3. Actual results: The fencing node will loop trying to fence that node Expected results: This patch allows the user to set a limit on the number of times the node will fail to fence the node before just removing the fenced node from its list. Additional info:
Created attachment 125432 [details] This patch allows the user to set the number of times fenced will try to fence a node This patch adds a fence_fail_count option, which if set, will let fenced try to fence a particular node fence_fail_count times before it simply removes the node from its node list. By default this is zero which leaves the behavior as default.
hmm, i thought i fixed this already. instead of count = 0, it should be count = 1. when i get in tomorrow i'll repost with the proper value.
This is a problem other people have had, too. The problem with the patch is that it's not safe to go on if the node really does need to be fenced. You need a human to confirm that fencing really can be skipped for the given node and then run a command to tell fenced to give up. If fenced is stuck trying to fence node1 for whatever reason and a human recognizes that node1 is off or has been properly reset, then the human can run a command like "fence_tool skip node1" to cause fenced to give up and assume node1 doesn't need to be fenced. A separate issue is if some nodes have been shut down and the admin wants to run the cluster without them for some time. You don't want the cluster to be fencing these nodes when it starts up. The proper thing to do would be to remove unused nodes from cluster.conf, but that can be a pain. There may be a nicer way of telling the system to ignore these unused nodes... not sure what that might be yet.
I'm not saying this is a production worthy option to have, but for drastic situations where you can't get your cluster to come back up or if you know the fencing script does not work and you are waiting on support to get it fixed, its a good thing to drop in there temporarily, kind of like the clean_start option. This is why i made it so its just an option that by default does nothing, so everything works the way it normaly does, and only comes into play if in the extreme case its needed.
What is the main problem you're trying to work around? Is it a situation where a new cluster forms and tries to fence nodes that are shut down (i.e. they haven't and won't join the cluster)? If that's the case, then you can use the clean start option by doing 'fence_tool join -c' and that should work around the problem for now. Doing this causes fenced to not even try fencing the nodes that are shut down. Or is this a situation where cluster members fail and fenced tries to fence them, but a script or fence device is broken so fenced gets into a loop failing to fence? If that's the case, then I still have the concern described above and suggest we develop a command that a user can run to tell fenced to quit trying.
Oh i missed that suggestion. Ok well I guess that covers all the situations where this pops up, no need for this patch then so I wont bother reposting with the fixes. thx much dave.
Um, from the comments above, I believe this behavior is needed. I am closing this as notabug -- please feel free to reopen if you disagree.