Bug 613064

Summary: Method to cause one node to delay fencing in a two node cluster
Product: Red Hat Enterprise Linux 5 Reporter: Lon Hohberger <lhh>
Component: cmanAssignee: Marek Grac <mgrac>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: clasohm, cluster-maint, djansa, djuran, edamato, fdinitto, jha, mgrac, slevine, teigland
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cman-2.0.115-47.el5 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 614046 (view as bug list) Environment:
Last Closed: 2011-01-13 22:35:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 614046    
Attachments:
Description Flags
proposed patch fdinitto: review? (teigland)

Description Lon Hohberger 2010-07-09 15:54:03 UTC
Description of problem:

Currently, there is no easy way to enable one host to "delay" fencing.  Users often simply craft a "fence_sleep", which works just fine, but requires a whole lot more work than should be necessary.  (Obviously, the agent "fence_sleep" itself is not suitable for inclusion in linux-cluster because it doesn't actually take any fencing actions; it simply sleeps and returns 0...)

Background:

Some fencing devices, such as HP iLO, take a long time to process requests.  On a cluster which partitions but still has access to the iLO devices, this can be problematic.  Because it takes a long time for iLO to process the requests, there is a window where two in-flight 'power-off' requests can cause the cluster to turn itself off - but not back on.

While this is great from a power-saving perspective, it is not very good from an availability perspective.

Ordinarily, this can be resolved using a quorum disk, however, a quorum disk is a fair bit of additional complexity and wholly unnecessary (or even undesirable) in many instances - for example, in clusters which serve data via NFS instead of a SAN, a quorum disk may not even be an option.

Proposal:

The proposal here is to add a method to make one node delay fencing for a period of time in order to allow the other node to "win" in the case of a network partition of the cluster intraconnect.  In the event that the "primary" node goes down, the "backup" node will, indeed, take longer to fence - but at a benefit of reduced complexity and highly deterministic behavior (which can't currently be achieved using qdiskd).

Fortunately, all of the core code required exists in the cman package today.  All we have to do is enable it on a per-host basis.

The specific proposal here, after talking with others, is to simply expose post_fail_delay via /etc/sysconfig/cman, and add it to the list of options when we start fenced.

For example, adding the following to /etc/sysconfig/cman on one host:

   POST_FAIL_DELAY=30

... and then, in the cman initscript, calling fenced with the corresponding -f option:

   fenced -f $POST_FAIL_DELAY

... should have the desired effect.

Comment 2 Fabio Massimo Di Nitto 2010-07-10 04:20:29 UTC
The only problem I see with this suggestion is that the delay is not immediately visible in cluster.conf.

My suggestion would be to have a generic/reserved keyword that fenced would process and consider as a sleep($time) directly.

We only need to make sure the keyword is not currently use by any fence agents.

fenced already does some parsing of fence agents options, so adding one keyword should be fairly simple and non-intrusive.

Comment 6 David Teigland 2010-07-12 18:09:18 UTC
Making post_fail_delay configurable in /etc/sysconfig doesn't preclude also adding delay args to agents where they are useful like ilo.  Both seem fine to me.

The /etc/sysconfig settings are obvious when you run ps, so they are not hidden.

Comment 7 Fabio Massimo Di Nitto 2010-07-13 11:20:34 UTC
Created attachment 431421 [details]
proposed patch

proposed patch in attachment.

    <clusternode name="rhel6-node1" votes="1" nodeid="1">
      <fence>
        <method name="single">
          <device name="virsh_fence" port="rhel6-node1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="rhel6-node2" votes="1" nodeid="2">
      <fence>
        <method name="single">
          <device name="virsh_fence" port="rhel6-node2" delay="20"/>
        </method>
      </fence>
    </clusternode>


[root@rhel6-node2 libfence]# fence_node rhel6-node1
fence rhel6-node1 success
[root@rhel6-node2 libfence]# fence_node rhel6-node2
Delay execution by 20 seconds
fence rhel6-node2 success
[root@rhel6-node2 libfence]# 

The keyword "delay" is currently unused and I briefly spoke to Marek on IRC that agrees it can be used as reserved word (since it won´t hit any agent).

David, I commented out the test code I used, I don´t plan to commit it in the final patch (assuming the patch is ok with you. this is mostly to prove that it works as we expect.

Comment 8 David Teigland 2010-07-13 13:35:43 UTC
Oh, dear, sorry, I completely misunderstood.  I thought you were talking about adding "delay" as a fence agent arg.  That would be ok with me.  I don't like at all hijacking one of the node args like comment 7 does.

So the two options which are both ok with me are
1. using post_fail_delay, with local config in /etc/sysconfig
2. adding delay args to fence agents where it's useful, like ilo

Comment 9 Marek Grac 2010-07-13 13:40:39 UTC
I agree with "delay" as reserved word

Comment 10 Perry Myers 2010-07-13 15:04:48 UTC
Ok reassigning to Marek since we'll do this as a delay option to the core python fencing library, and then on an as needed basis extend to other fences that are outside of the core fencing library.

Comment 17 errata-xmlrpc 2011-01-13 22:35:25 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0036.html