Bug 741339 - fence_scsi: fence_scsi.dev file gets unlinked on each unfence operation
Summary: fence_scsi: fence_scsi.dev file gets unlinked on each unfence operation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: fence-agents
Version: 6.3
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: rc
: ---
Assignee: Ryan O'Hara
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 756082
TreeView+ depends on / blocked
 
Reported: 2011-09-26 15:54 UTC by Ryan O'Hara
Modified: 2012-06-20 14:40 UTC (History)
3 users (show)

Fixed In Version: fence-agents-3.1.5-11
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-20 14:40:05 UTC


Attachments (Terms of Use)
Remove unlinke for fence_scsi.dev file (1.53 KB, patch)
2011-09-27 22:29 UTC, Ryan O'Hara
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2012:0943 normal SHIPPED_LIVE fence-agents bug fix and enhancement update 2012-06-19 21:00:16 UTC

Description Ryan O'Hara 2011-09-26 15:54:27 UTC
The fence_scsi agent creates a file (/var/run/cluster/fence_scsi.dev) that contains a list of devices that the node registered with during the unfence operation. This file is unlinked for every unfence action, which creates a problem if you use multiple fence device entries in cluster.conf, the fence_scsi.dev file will contain only the devices that the node registered with during the most recent unfence operation.

This is best explained with an example. Consider the following cluster.conf file:

<?xml version="1.0"?>
<cluster config_version="1" name="foobar">
    <cman two_node="1" expected_votes="1" cluster_id="77"/>
    <fence_daemon post_fail_delay="0" post_join_delay="30"/>
    <clusternodes>
        <clusternode name="foo" votes="1" nodeid="3">
            <fence>
            <method name="scsi">
                <device name="scsi_1" key="3"/>
                <device name="scsi_2" key="3"/>
                <device name="scsi_3" key="3"/>
            </method>
            </fence>
            <unfence>
                <device name="scsi_1" key="3" action="on"/>
                <device name="scsi_2" key="3" action="on"/>
                <device name="scsi_3" key="3" action="on"/>
            </unfence>
        </clusternode>
        <clusternode name="bar" votes="1" nodeid="4">
            <fence>
            <method name="scsi">
                <device name="scsi_1" key="4"/>
                <device name="scsi_2" key="4"/>
                <device name="scsi_3" key="4"/>
            </method>
            </fence>
            <unfence>
                <device name="scsi_1" key="4" action="on"/>
                <device name="scsi_2" key="4" action="on"/>
                <device name="scsi_3" key="4" action="on"/>
            </unfence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice agent="fence_scsi" name="scsi_1"
         devices="/dev/sdb,/dev/sdc"
         logfile="/tmp/fence_scsi.log"/>
        <fencedevice agent="fence_scsi" name="scsi_2"
         devices="/dev/sdd,/dev/sde"
         logfile="/tmp/fence_scsi.log"/>
        <fencedevice agent="fence_scsi" name="scsi_3"
         devices="/dev/sdf,/dev/sdg"
         logfile="/tmp/fence_scsi.log"/>
    </fencedevices>
    <rm>
        <failoverdomains/>
        <resources/>
    </rm>
</cluster>

This is a valid cluster.conf file in which multiple fencedevice entries exist for the fence_scsi agent, each containing a different list of devices. When unfencing occurs, the fence_scsi agent will be called three times. Each time fence_scsi registers some devices, the fence_scsi.dev file will be unlinked. The result is that once unfencing is complete, the fence_scsi.dev file will contain:

/dev/sdf
/dev/sdg

The expected result is that fence_scsi.dev will contain all the devices:

/dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf
/dev/sdg

Note that this problem only occurs when devices are manually defined and they are listed in multiple fencedevice entries.

The fence_scsi.dev file is only used by the fence_scsi_check watchdog script. This file provides a list of devices that fence_scsi_check should check periodically for registrations. If the fence_scsi_check watchdog script is not being used, this problem has no effect.

Comment 3 Ryan O'Hara 2011-09-27 22:29:27 UTC
Created attachment 525221 [details]
Remove unlinke for fence_scsi.dev file

This patch removes the unlink call that deletes the fence_scsi.dev file on each unfence (action=on) operation. As explained in comment #1, there is a specific case where unfencing can result in multiple calls of 'fence_scsi -o on ...', where each of those calls would unlink the fence_scsi.dev file.

The fence_scsi.dev file is used to keep track of what devices the local node is currently registered with. Currently it is only used by the fence_scsi_check watchdog script.

Rather than remove the fence_scsi.dev, this patch will check to see if the device currently being registered already exists in the fence_scsi.dev file. If it does not, write it to the file.

Note that removing the unlink is safe because the fence_scsi.dev file exists in /var/run/cluster/ directory and therefore will be removed on reboot.

Comment 4 Ryan O'Hara 2011-09-27 22:42:40 UTC
Test result:

* Without the patch, use a cluster.conf file similar to the one in comment #1. The key is to have multiple fencedevice entries for fence_scsi that contain different devices.

# service cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]

# cat /var/run/cluster/fence_scsi.dev
/dev/sdf
/dev/sdg

Here we can see the problem -- only /dev/sdf and /dev/sdg exist in the file because the file was unlinked each time fence_scsi was called with action=on.

* With the patch, same configuration. Before running this test, the fence_scsi.dev file can be removed manually or by rebooting the machine.

# rm -f /var/run/cluster/fence_scsi.dev

# service cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]

# cat /var/run/cluster/fence_scsi.dev
/dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf
/dev/sdg

* Now we should be able to run 'service cman retart' or 'service cman start' without getting duplicate entries in the /var/run/cluster/fence_scsi.dev file.

# service cman restart
Stopping cluster: 
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Waiting for corosync to shutdown:                       [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]

# cat /var/run/cluster/fence_scsi.dev
/dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf
/dev/sdg

Comment 7 Ryan O'Hara 2011-12-19 16:43:44 UTC
Pushed to RHEL6 branch upstream.

commit 909d5b2c40b7f9b233a0aa5e19f3d5c83d0577c4

Comment 11 Nate Straz 2012-05-25 14:59:42 UTC
Verified against fence-agents-3.1.5-17.el6.x86_64


[root@smoke-02 ~]# cat /var/run/cluster/fence_scsi.dev
/dev/sdb
/dev/sdc

cluster.conf fragments:

  <clusternodes>
    <clusternode name="smoke-02" votes="1" nodeid="2">
      <fence>
        <method name="scsi">
                <device name="scsi_1" key="2"/>
                <device name="scsi_2" key="2"/>
        </method>
      </fence>
      <unfence>
        <device name="scsi_2" key="1" action="on"/>
        <device name="scsi_1" key="2" action="on"/>
      </unfence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_scsi" name="scsi_1" devices="/dev/sdb"/>
    <fencedevice agent="fence_scsi" name="scsi_2" devices="/dev/sdc"/>
  </fencedevices>

Comment 12 errata-xmlrpc 2012-06-20 14:40:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0943.html


Note You need to log in before you can comment on or make changes to this bug.