Hide Forgot
The fence_scsi agent creates a file (/var/run/cluster/fence_scsi.dev) that contains a list of devices that the node registered with during the unfence operation. This file is unlinked for every unfence action, which creates a problem if you use multiple fence device entries in cluster.conf, the fence_scsi.dev file will contain only the devices that the node registered with during the most recent unfence operation. This is best explained with an example. Consider the following cluster.conf file: <?xml version="1.0"?> <cluster config_version="1" name="foobar"> <cman two_node="1" expected_votes="1" cluster_id="77"/> <fence_daemon post_fail_delay="0" post_join_delay="30"/> <clusternodes> <clusternode name="foo" votes="1" nodeid="3"> <fence> <method name="scsi"> <device name="scsi_1" key="3"/> <device name="scsi_2" key="3"/> <device name="scsi_3" key="3"/> </method> </fence> <unfence> <device name="scsi_1" key="3" action="on"/> <device name="scsi_2" key="3" action="on"/> <device name="scsi_3" key="3" action="on"/> </unfence> </clusternode> <clusternode name="bar" votes="1" nodeid="4"> <fence> <method name="scsi"> <device name="scsi_1" key="4"/> <device name="scsi_2" key="4"/> <device name="scsi_3" key="4"/> </method> </fence> <unfence> <device name="scsi_1" key="4" action="on"/> <device name="scsi_2" key="4" action="on"/> <device name="scsi_3" key="4" action="on"/> </unfence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_scsi" name="scsi_1" devices="/dev/sdb,/dev/sdc" logfile="/tmp/fence_scsi.log"/> <fencedevice agent="fence_scsi" name="scsi_2" devices="/dev/sdd,/dev/sde" logfile="/tmp/fence_scsi.log"/> <fencedevice agent="fence_scsi" name="scsi_3" devices="/dev/sdf,/dev/sdg" logfile="/tmp/fence_scsi.log"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> This is a valid cluster.conf file in which multiple fencedevice entries exist for the fence_scsi agent, each containing a different list of devices. When unfencing occurs, the fence_scsi agent will be called three times. Each time fence_scsi registers some devices, the fence_scsi.dev file will be unlinked. The result is that once unfencing is complete, the fence_scsi.dev file will contain: /dev/sdf /dev/sdg The expected result is that fence_scsi.dev will contain all the devices: /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg Note that this problem only occurs when devices are manually defined and they are listed in multiple fencedevice entries. The fence_scsi.dev file is only used by the fence_scsi_check watchdog script. This file provides a list of devices that fence_scsi_check should check periodically for registrations. If the fence_scsi_check watchdog script is not being used, this problem has no effect.
Created attachment 525221 [details] Remove unlinke for fence_scsi.dev file This patch removes the unlink call that deletes the fence_scsi.dev file on each unfence (action=on) operation. As explained in comment #1, there is a specific case where unfencing can result in multiple calls of 'fence_scsi -o on ...', where each of those calls would unlink the fence_scsi.dev file. The fence_scsi.dev file is used to keep track of what devices the local node is currently registered with. Currently it is only used by the fence_scsi_check watchdog script. Rather than remove the fence_scsi.dev, this patch will check to see if the device currently being registered already exists in the fence_scsi.dev file. If it does not, write it to the file. Note that removing the unlink is safe because the fence_scsi.dev file exists in /var/run/cluster/ directory and therefore will be removed on reboot.
Test result: * Without the patch, use a cluster.conf file similar to the one in comment #1. The key is to have multiple fencedevice entries for fence_scsi that contain different devices. # service cman start Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting gfs_controld... [ OK ] Unfencing self... [ OK ] Joining fence domain... [ OK ] # cat /var/run/cluster/fence_scsi.dev /dev/sdf /dev/sdg Here we can see the problem -- only /dev/sdf and /dev/sdg exist in the file because the file was unlinked each time fence_scsi was called with action=on. * With the patch, same configuration. Before running this test, the fence_scsi.dev file can be removed manually or by rebooting the machine. # rm -f /var/run/cluster/fence_scsi.dev # service cman start Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting gfs_controld... [ OK ] Unfencing self... [ OK ] Joining fence domain... [ OK ] # cat /var/run/cluster/fence_scsi.dev /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg * Now we should be able to run 'service cman retart' or 'service cman start' without getting duplicate entries in the /var/run/cluster/fence_scsi.dev file. # service cman restart Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld... [ OK ] Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Waiting for corosync to shutdown: [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting gfs_controld... [ OK ] Unfencing self... [ OK ] Joining fence domain... [ OK ] # cat /var/run/cluster/fence_scsi.dev /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg
Pushed to RHEL6 branch upstream. commit 909d5b2c40b7f9b233a0aa5e19f3d5c83d0577c4
Verified against fence-agents-3.1.5-17.el6.x86_64 [root@smoke-02 ~]# cat /var/run/cluster/fence_scsi.dev /dev/sdb /dev/sdc cluster.conf fragments: <clusternodes> <clusternode name="smoke-02" votes="1" nodeid="2"> <fence> <method name="scsi"> <device name="scsi_1" key="2"/> <device name="scsi_2" key="2"/> </method> </fence> <unfence> <device name="scsi_2" key="1" action="on"/> <device name="scsi_1" key="2" action="on"/> </unfence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_scsi" name="scsi_1" devices="/dev/sdb"/> <fencedevice agent="fence_scsi" name="scsi_2" devices="/dev/sdc"/> </fencedevices>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0943.html