Bug 570373 - fencing virtual machine using fence_xvm/fence_xvmd fails when physical host for the virtual machine goes down
fencing virtual machine using fence_xvm/fence_xvmd fails when physical host f...
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.2
All Linux
low Severity medium
: rc
: ---
Assigned To: Ryan McCabe
Chris Mackowski
:
Depends On:
Blocks: 807971
  Show dependency treegraph
 
Reported: 2010-03-03 22:43 EST by Bernard Chew
Modified: 2016-04-26 12:43 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-10-11 17:07:31 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bernard Chew 2010-03-03 22:43:00 EST
Description of problem:

when a physical host node goes down, DRAC fencing takes place successfully for the physical host. However virtual fencing fails for the virtual guests on the physical host which go down (together).

Version-Release number of selected component (if applicable): 

6.0.1

How reproducible:

100%

Steps to Reproduce:

1. Ensure cluster services do not start automatically on physical host
2. Ensure physical host does not start virtual machines automatically
3. Simulate physical host unavailable during normal operation
  
Actual results:

Log shows DRAC fencing for physical machine success but fencing for virtual machine fails

Expected results:

Fencing should take place successfully for virtual machine so that the remaining virtual machines can continue normal operations

Additional info:

NA
Comment 1 Christine Caulfield 2010-03-08 05:25:20 EST
I'm not sure if this is your bug or not, Lon.
Comment 4 Lon Hohberger 2010-03-09 09:52:10 EST
The way fence_xvmd is designed:

1) store VMs in checkpoints
2) fence_xvm uses multicast, so all hosts receive the fencing request
3) when a fencing request comes in, low node ID reads the checkpoint
4) If the last-known-owner is dead (and fenced), then fence_xvmd on the low node ID responds to the original host with a successful fencing operation.
5) if the last-known-owner is alive (or not fenced), then fence_xvmd does nothing
6) if _we_ receive the packet and _we_ are the owner of the VM, then we take fencing action (virDomainDestroy())

My guess is that (2) is not working: all hosts are not receiving the request.

Consequently, can you run:

# fence_xvmd -fddddddddddddddddddddd &> fence_xvmd.log

on all nodes, reproduce, and upload the log file when the fencing request fails?
Comment 5 Bernard Chew 2010-03-15 00:04:42 EDT
Sorry for the late reply guys.

I can't perform the tests now as these are production servers. But I did run the command before, and the outputs indicated communication between nodes are taking place.

With regards to multicast settings, I have the below settings in cluster.conf in both physical host and virtual guest clusters (below is taken from physical host cluster.conf).

	<clusternode name="node_name.domain_name" nodeid="5" votes="1">
		<fence>
			<method name="1">
				<device modulename="" name="drac5"/>
			</method>
		</fence>
		<multicast addr="225.0.0.12" interface="eth3"/>
	</clusternode>

<cman>
	<multicast addr="225.0.0.12"/>
</cman>

In addition, a static route for multicast is also added to ensure the traffic goes through the private ethernet interface eth3. Firewall (iptables) is also configured to allow such traffic to pass.

Lastly, I remembered when I encountered this situation previously I had to run the below command in one of the remaining virtual guests for the cluster to continue operation;

"echo failed_virtual_node_name > /var/run/cluster/fenced_overrride"

Regards,
Bernard Chew
Comment 11 RHEL Product and Program Management 2012-04-02 06:35:26 EDT
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.
Comment 21 Lon Hohberger 2012-10-11 17:07:31 EDT
We've attempted unsuccessfully to reproduce this bug many times.

Note You need to log in before you can comment on or make changes to this bug.