Bug 769658 - /var/run/cluster/fenced_override only exists during 5 seconds intervals
Summary: /var/run/cluster/fenced_override only exists during 5 seconds intervals
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.7
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-12-21 16:06 UTC by Julio Entrena Perez
Modified: 2011-12-21 22:52 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-21 22:42:45 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Julio Entrena Perez 2011-12-21 16:06:26 UTC
> Description of problem:
The /var/run/cluster/fenced_override FIFO is removed during fence attempts.

In order to manually acknowledge a fence request one needs to do so *after* a fence attempt failure and *before* the next attempt is initiated.

> Version-Release number of selected component (if applicable):
cman-2.0.115-85.el5_7.2 .

> How reproducible:
Always.

> Steps to Reproduce:
1. Disable the fencing mechanism in node A.
2. Remove the network or shutdown node A.
3. Allow node B to attempt to fence node A and fail.
  
> Actual results:
The FIFO appears and shortly after disappears again:
[root@cl5764a ~]# ls -l /var/run/cluster/
total 4
-rw-r--r-- 1 root root 5 Dec 18 23:48 ccsd.pid
srwx------ 1 root root 0 Dec 18 23:48 ccsd.sock
srw-rw---- 1 root root 0 Dec 18 23:48 rgmanager.sk
[root@cl5764a ~]# ls -l /var/run/cluster/
total 4
-rw-r--r-- 1 root root 5 Dec 18 23:48 ccsd.pid
srwx------ 1 root root 0 Dec 18 23:48 ccsd.sock
prw------- 1 root root 0 Dec 21 15:29 fenced_override     <-----
srw-rw---- 1 root root 0 Dec 18 23:48 rgmanager.sk
[root@cl5764a ~]# ls -l /var/run/cluster/
total 4
-rw-r--r-- 1 root root 5 Dec 18 23:48 ccsd.pid
srwx------ 1 root root 0 Dec 18 23:48 ccsd.sock
srw-rw---- 1 root root 0 Dec 18 23:48 rgmanager.sk

/var/run/cluster/fenced_override seems to be removed while another attempt is ongoing:
Dec 21 15:36:54 cl5764a fenced[2137]: fencing node "cl5764b"
Dec 21 15:37:24 cl5764a fenced[2137]: agent "fence_xvm" reports: Timed out waiting for response 
Dec 21 15:37:24 cl5764a fenced[2137]: fence "cl5764b" failed
Dec 21 15:37:29 cl5764a fenced[2137]: fencing node "cl5764b"

/var/run/cluster/fenced_override appears at 15:37:24 (after the failure) but is removed again at 15:37:29 (when the next attempt begins).
That's a 5 seconds window every 30 seconds to acknowledge a fence request.

> Expected results:
/var/run/cluster/fenced_override stays in place until a fence request succeeds or is manually acknowledged.

Comment 1 Lon Hohberger 2011-12-21 22:42:45 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=418541

fence_ack_manual waits for the fenced_override socket since RHEL 5.2.0.

Comment 2 Lon Hohberger 2011-12-21 22:44:31 UTC
[root@rhel5-2 ~]# fence_ack_manual -h
Usage:

fence_ack_manual [options]

Options:
  -h               usage
  -O               override
  -n <nodename>    Name of node that was manually fenced
  -s <ip>          IP address of machine that was manually fenced (deprecated)
  -e               Emergency Override (use if fencing has failed)
  -V               Version information

From man page:

       -e     Emergency fencing override.  This may be used  when
              fencing a given host is failing in order to restore
              the cluster to operation.

Syntax:

fence_ack_manual -e -n <foo>

Where "foo" is the name of the host where fencing is failing.

Comment 3 Lon Hohberger 2011-12-21 22:52:33 UTC
On RHEL6, for reference:

[root@rhel6-1 ~]# fence_ack_manual
usage:
        /usr/sbin/fence_ack_manual <nodename>
        /usr/sbin/fence_ack_manual -n <nodename>

The -n flag exists to preserve compatibility with previous 
releases of /usr/sbin/fence_ack_manual, and is no longer required.


Note You need to log in before you can comment on or make changes to this bug.