Red Hat Bugzilla – Bug 570373
fencing virtual machine using fence_xvm/fence_xvmd fails when physical host for the virtual machine goes down
Last modified: 2016-04-26 12:43:46 EDT
Description of problem:
when a physical host node goes down, DRAC fencing takes place successfully for the physical host. However virtual fencing fails for the virtual guests on the physical host which go down (together).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Ensure cluster services do not start automatically on physical host
2. Ensure physical host does not start virtual machines automatically
3. Simulate physical host unavailable during normal operation
Log shows DRAC fencing for physical machine success but fencing for virtual machine fails
Fencing should take place successfully for virtual machine so that the remaining virtual machines can continue normal operations
I'm not sure if this is your bug or not, Lon.
The way fence_xvmd is designed:
1) store VMs in checkpoints
2) fence_xvm uses multicast, so all hosts receive the fencing request
3) when a fencing request comes in, low node ID reads the checkpoint
4) If the last-known-owner is dead (and fenced), then fence_xvmd on the low node ID responds to the original host with a successful fencing operation.
5) if the last-known-owner is alive (or not fenced), then fence_xvmd does nothing
6) if _we_ receive the packet and _we_ are the owner of the VM, then we take fencing action (virDomainDestroy())
My guess is that (2) is not working: all hosts are not receiving the request.
Consequently, can you run:
# fence_xvmd -fddddddddddddddddddddd &> fence_xvmd.log
on all nodes, reproduce, and upload the log file when the fencing request fails?
Sorry for the late reply guys.
I can't perform the tests now as these are production servers. But I did run the command before, and the outputs indicated communication between nodes are taking place.
With regards to multicast settings, I have the below settings in cluster.conf in both physical host and virtual guest clusters (below is taken from physical host cluster.conf).
<clusternode name="node_name.domain_name" nodeid="5" votes="1">
<device modulename="" name="drac5"/>
<multicast addr="126.96.36.199" interface="eth3"/>
In addition, a static route for multicast is also added to ensure the traffic goes through the private ethernet interface eth3. Firewall (iptables) is also configured to allow such traffic to pass.
Lastly, I remembered when I encountered this situation previously I had to run the below command in one of the remaining virtual guests for the cluster to continue operation;
"echo failed_virtual_node_name > /var/run/cluster/fenced_overrride"
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release. Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products. This request is not yet committed for inclusion in
We've attempted unsuccessfully to reproduce this bug many times.