Description of problem: fence_node tool does not always identify that fencing failed (especially when fence_xvm + fence_virtd used). There are situations when fencing using fence_node <node-name> lead to success, but fencing did not happened. Situation: - three node VM cluster based on RHEL6.3 i686, x86_64, x86_64 - cluster.conf specifying three fence_xvm fence devices - VM / cluster host on RHEL 6.3 x86_64 - fence-virtd running to connect to fence requests [root@dhcp-x-y ~]# clustat Cluster Status for mycluster_el6vm @ Mon Sep 3 12:08:48 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 192.168.10.11 1 Online 192.168.10.12 2 Online 192.168.10.13 3 Online, Local Fencing node looks to succeed, but fencing action was not performed due to libvirt error... [root@dhcp-x-y ~]# fence_node 192.168.10.11 fence 192.168.10.11 success But on VM host fence-virtd says: Request 2 seqno 289568 domain 192.168.10.11 Plain TCP request ipv4_connect: Connecting to client ipv4_connect: Success; fd = 8 Request 2 seqno 289568 src 192.168.10.11 target 192.168.10.11 libvirt_reboot 192.168.10.11 libvir: QEMU error : Domain not found: no domain with matching name '192.168.10.11' [libvirt:REBOOT] Nothing to do - domain does not exist Sending response to caller... Fence_node does not correctly say whether fencing was performed without errors. Version-Release number of selected component (if applicable): # guests cman-3.0.12.1-32.el6_3.1.i686 corosync-1.4.1-7.el6.i686 clusterlib-3.0.12.1-32.el6_3.1.i686 ricci-0.16.2-55.el6.i686 # host fence-virt-0.2.3-9.el6.x86_64 fence-virtd-0.2.3-9.el6.x86_64 fence-virtd-checkpoint-0.2.3-9.el6.x86_64 fence-virtd-libvirt-0.2.3-9.el6.x86_64 fence-virtd-multicast-0.2.3-9.el6.x86_64 fence-virtd-serial-0.2.3-9.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. set-up VM fencing 2. perform 'fence_xvm -o list' to verify fence-virt functionality 3. fence_node <cluster-node> Actual results: fence_node fails claiming to succeed (ecode:0) Expected results: Whan fencing fails fence_node should fail ecode != 0. Additional info: [root@dhcp-lab-x ~]# clustat Cluster Status for mycluster_el6vm @ Mon Sep 3 12:08:52 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 192.168.10.11 1 Online 192.168.10.12 2 Online, Local 192.168.10.13 3 Online [root@dhcp-x-y ~]# fence_node 192.168.10.11 fence 192.168.10.11 success [root@dhcp-x-y ~]# echo $? 0 # after that 192.168.10.11 does not reboot # host fence-virtd logs: ... Request 2 seqno 289568 domain 192.168.10.11 Plain TCP request ipv4_connect: Connecting to client ipv4_connect: Success; fd = 8 Request 2 seqno 289568 src 192.168.10.11 target 192.168.10.11 libvirt_reboot 192.168.10.11 libvir: QEMU error : Domain not found: no domain with matching name '192.168.10.11' [libvirt:REBOOT] Nothing to do - domain does not exist Sending response to caller... ... [root@dhcp-lab-x ~]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster config_version="7" name="mycluster_el6vm"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/> <fence_xvmd debug="10" multicast_interface="eth3"/> <clusternodes> <clusternode name="192.168.10.11" nodeid="1" votes="1"> <fence> <method name="1"> <device domain="192.168.10.11" key_file="/etc/cluster/fence_xvm.key" name="fence_1"/> </method> </fence> </clusternode> <clusternode name="192.168.10.12" nodeid="2" votes="1"> <fence> <method name="1"> <device domain="192.168.10.12" key_file="/etc/cluster/fence_xvm.key" name="fence_2"/> </method> </fence> </clusternode> <clusternode name="192.168.10.13" nodeid="3" votes="1"> <fence> <method name="1"> <device domain="192.168.10.13" key_file="/etc/cluster/fence_xvm.key" name="fence_3"/> </method> </fence> </clusternode> </clusternodes> <cman port="1229"> <multicast addr="225.0.0.12"/> </cman> <rm log_level="7"> <failoverdomains> <failoverdomain name="domain_qpidd_1" restricted="1"> <failoverdomainnode name="192.168.10.11" priority="1"/> </failoverdomain> <failoverdomain name="domain_qpidd_2" restricted="1"> <failoverdomainnode name="192.168.10.12" priority="1"/> </failoverdomain> <failoverdomain name="domain_qpidd_3" restricted="1"> <failoverdomainnode name="192.168.10.13" priority="1"/> </failoverdomain> </failoverdomains> <resources> <script file="/etc/init.d/qpidd" name="qpidd"/> </resources> <service domain="domain_qpidd_1" name="qpidd_1"> <script ref="qpidd"/> </service> <service domain="domain_qpidd_2" name="qpidd_2"> <script ref="qpidd"/> </service> <service domain="domain_qpidd_3" name="qpidd_3"> <script ref="qpidd"/> </service> </rm> <fencedevices> <fencedevice action="reboot" agent="fence_xvm" key_file="/etc/cluster/fence_xvm.key" name="fence_1"/> <fencedevice action="reboot" agent="fence_xvm" key_file="/etc/cluster/fence_xvm.key" name="fence_2"/> <fencedevice action="reboot" agent="fence_xvm" key_file="/etc/cluster/fence_xvm.key" name="fence_3"/> </fencedevices> </cluster> [root@HOST ~]# cat /etc/fence_virt.conf fence_virtd { listener = "multicast"; backend = "libvirt"; module_path = "/usr/lib64/fence-virt"; } listeners { multicast { key_file = "/etc/cluster/fence_xvm.key"; address = "225.0.0.12"; port = "1229"; family = "ipv4"; interface = "virbr4"; } } backends { libvirt { uri = "qemu:///system"; } }
This issue I was seeing above is due to misconfiguration that fencing domain was not set to libvirt/QEMU VM domain name. Regardless of that fact fence_node should provide consistent information to user how the requested operation finished. # fence_xvm -o list cluster-rhel6i0 9747d8b2-9e04-6e84-b920-953651e32251 on cluster-rhel6x0 62079d69-33c4-7133-65a6-7ae0db52131e on cluster-rhel6x1 b8d18c15-bbed-7496-1af4-90afa0cdf95f on # cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster config_version="8" name="mycluster_el6vm"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/> <fence_xvmd debug="10" multicast_interface="eth3"/> <clusternodes> <clusternode name="192.168.10.11" nodeid="1" votes="1"> <fence> <method name="1"> <device domain="cluster-rhel6i0" key_file="/etc/cluster/fence_xvm.key" name="fence_1"/> </method> </fence> </clusternode> <clusternode name="192.168.10.12" nodeid="2" votes="1"> <fence> <method name="1"> <device domain="cluster-rhel6x0" key_file="/etc/cluster/fence_xvm.key" name="fence_2"/> </method> </fence> </clusternode> <clusternode name="192.168.10.13" nodeid="3" votes="1"> <fence> <method name="1"> <device domain="cluster-rhel6x1" key_file="/etc/cluster/fence_xvm.key" name="fence_3"/> </method> </fence> </clusternode> </clusternodes> <cman port="1229"> <multicast addr="225.0.0.12"/> </cman> <rm log_level="7"> <failoverdomains> <failoverdomain name="domain_qpidd_1" restricted="1"> <failoverdomainnode name="192.168.10.11" priority="1"/> </failoverdomain> <failoverdomain name="domain_qpidd_2" restricted="1"> <failoverdomainnode name="192.168.10.12" priority="1"/> </failoverdomain> <failoverdomain name="domain_qpidd_3" restricted="1"> <failoverdomainnode name="192.168.10.13" priority="1"/> </failoverdomain> </failoverdomains> <resources> <script file="/etc/init.d/qpidd" name="qpidd"/> </resources> <service domain="domain_qpidd_1" name="qpidd_1"> <script ref="qpidd"/> </service> <service domain="domain_qpidd_2" name="qpidd_2"> <script ref="qpidd"/> </service> <service domain="domain_qpidd_3" name="qpidd_3"> <script ref="qpidd"/> </service> </rm> <fencedevices> <fencedevice action="reboot" agent="fence_xvm" key_file="/etc/cluster/fence_xvm.key" name="fence_1"/> <fencedevice action="reboot" agent="fence_xvm" key_file="/etc/cluster/fence_xvm.key" name="fence_2"/> <fencedevice action="reboot" agent="fence_xvm" key_file="/etc/cluster/fence_xvm.key" name="fence_3"/> </fencedevices> </cluster>
Created attachment 611546 [details] Fix
*** Bug 870549 has been marked as a duplicate of this bug. ***
Marking Verified because I got through fencing and skeet testing which was tripping over this earlier in the release cycle.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0419.html