Description of problem:
In RHEL5, we introduced a socket whereby administrators could issue commands to
unstick cluster nodes where fencing has failed. It looks like this:
echo "nodename.mydomain.com" > /var/run/cluster/fenced_override
The problem is that this is highly timing dependent - that is, an administrator
must hit it within the 5-second fence retry window.
In the head branch of CVS, fence_ack_manual is a script which waits for
/var/run/cluster/fenced_override to exist before issuing the command.
fence_ack_manual in the RHEL5 branch should also be able to do this. This will
enable administrators to fix broken clusters with less difficulty.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Created attachment 289271 [details]
Makes fence_ack_manual work as override (needs -e flag)
Worked for me.
Dec 14 11:53:53 molly fenced: frederick not a cluster member after 6 sec
Dec 14 11:53:53 molly fenced: fencing node "frederick"
Dec 14 11:53:53 molly fenced: fence "frederick" failed
Dec 14 11:53:54 molly fenced: fence "frederick" overridden by administrator
Patch in CVS
Checking in agents/manual/Makefile;
/cvs/cluster/cluster/fence/agents/manual/Makefile,v <-- Makefile
new revision: 220.127.116.11; previous revision: 1.7
Checking in agents/manual/ack.c;
/cvs/cluster/cluster/fence/agents/manual/Attic/ack.c,v <-- ack.c
new revision: 18.104.22.168; previous revision: 1.3
I have tested this using the following command with cman-2.0.81:
fence_ack_manual -e -n frederick
It works as expected; fence_ack_manual waits for fencing to fail (as is
expected; I simply did a 'reboot -fn' while disabling the fencing device) and
issues the override for us. It's far easier to use than timing the "echo" method.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.