Hide Forgot
> Description of problem: The /var/run/cluster/fenced_override FIFO is removed during fence attempts. In order to manually acknowledge a fence request one needs to do so *after* a fence attempt failure and *before* the next attempt is initiated. > Version-Release number of selected component (if applicable): cman-2.0.115-85.el5_7.2 . > How reproducible: Always. > Steps to Reproduce: 1. Disable the fencing mechanism in node A. 2. Remove the network or shutdown node A. 3. Allow node B to attempt to fence node A and fail. > Actual results: The FIFO appears and shortly after disappears again: [root@cl5764a ~]# ls -l /var/run/cluster/ total 4 -rw-r--r-- 1 root root 5 Dec 18 23:48 ccsd.pid srwx------ 1 root root 0 Dec 18 23:48 ccsd.sock srw-rw---- 1 root root 0 Dec 18 23:48 rgmanager.sk [root@cl5764a ~]# ls -l /var/run/cluster/ total 4 -rw-r--r-- 1 root root 5 Dec 18 23:48 ccsd.pid srwx------ 1 root root 0 Dec 18 23:48 ccsd.sock prw------- 1 root root 0 Dec 21 15:29 fenced_override <----- srw-rw---- 1 root root 0 Dec 18 23:48 rgmanager.sk [root@cl5764a ~]# ls -l /var/run/cluster/ total 4 -rw-r--r-- 1 root root 5 Dec 18 23:48 ccsd.pid srwx------ 1 root root 0 Dec 18 23:48 ccsd.sock srw-rw---- 1 root root 0 Dec 18 23:48 rgmanager.sk /var/run/cluster/fenced_override seems to be removed while another attempt is ongoing: Dec 21 15:36:54 cl5764a fenced[2137]: fencing node "cl5764b" Dec 21 15:37:24 cl5764a fenced[2137]: agent "fence_xvm" reports: Timed out waiting for response Dec 21 15:37:24 cl5764a fenced[2137]: fence "cl5764b" failed Dec 21 15:37:29 cl5764a fenced[2137]: fencing node "cl5764b" /var/run/cluster/fenced_override appears at 15:37:24 (after the failure) but is removed again at 15:37:29 (when the next attempt begins). That's a 5 seconds window every 30 seconds to acknowledge a fence request. > Expected results: /var/run/cluster/fenced_override stays in place until a fence request succeeds or is manually acknowledged.
https://bugzilla.redhat.com/show_bug.cgi?id=418541 fence_ack_manual waits for the fenced_override socket since RHEL 5.2.0.
[root@rhel5-2 ~]# fence_ack_manual -h Usage: fence_ack_manual [options] Options: -h usage -O override -n <nodename> Name of node that was manually fenced -s <ip> IP address of machine that was manually fenced (deprecated) -e Emergency Override (use if fencing has failed) -V Version information From man page: -e Emergency fencing override. This may be used when fencing a given host is failing in order to restore the cluster to operation. Syntax: fence_ack_manual -e -n <foo> Where "foo" is the name of the host where fencing is failing.
On RHEL6, for reference: [root@rhel6-1 ~]# fence_ack_manual usage: /usr/sbin/fence_ack_manual <nodename> /usr/sbin/fence_ack_manual -n <nodename> The -n flag exists to preserve compatibility with previous releases of /usr/sbin/fence_ack_manual, and is no longer required.