Bug 570373

Summary: fencing virtual machine using fence_xvm/fence_xvmd fails when physical host for the virtual machine goes down
Product: Red Hat Enterprise Linux 5 Reporter: Bernard Chew <bernardchew>
Component: cmanAssignee: Ryan McCabe <rmccabe>
Status: CLOSED WORKSFORME QA Contact: Chris Mackowski <cmackows>
Severity: medium Docs Contact:
Priority: low    
Version: 5.2CC: cluster-maint, cmackows, djansa, iannis, jentrena, mnovacek
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-11 21:07:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 807971    

Description Bernard Chew 2010-03-04 03:43:00 UTC
Description of problem:

when a physical host node goes down, DRAC fencing takes place successfully for the physical host. However virtual fencing fails for the virtual guests on the physical host which go down (together).

Version-Release number of selected component (if applicable): 

6.0.1

How reproducible:

100%

Steps to Reproduce:

1. Ensure cluster services do not start automatically on physical host
2. Ensure physical host does not start virtual machines automatically
3. Simulate physical host unavailable during normal operation
  
Actual results:

Log shows DRAC fencing for physical machine success but fencing for virtual machine fails

Expected results:

Fencing should take place successfully for virtual machine so that the remaining virtual machines can continue normal operations

Additional info:

NA

Comment 1 Christine Caulfield 2010-03-08 10:25:20 UTC
I'm not sure if this is your bug or not, Lon.

Comment 4 Lon Hohberger 2010-03-09 14:52:10 UTC
The way fence_xvmd is designed:

1) store VMs in checkpoints
2) fence_xvm uses multicast, so all hosts receive the fencing request
3) when a fencing request comes in, low node ID reads the checkpoint
4) If the last-known-owner is dead (and fenced), then fence_xvmd on the low node ID responds to the original host with a successful fencing operation.
5) if the last-known-owner is alive (or not fenced), then fence_xvmd does nothing
6) if _we_ receive the packet and _we_ are the owner of the VM, then we take fencing action (virDomainDestroy())

My guess is that (2) is not working: all hosts are not receiving the request.

Consequently, can you run:

# fence_xvmd -fddddddddddddddddddddd &> fence_xvmd.log

on all nodes, reproduce, and upload the log file when the fencing request fails?

Comment 5 Bernard Chew 2010-03-15 04:04:42 UTC
Sorry for the late reply guys.

I can't perform the tests now as these are production servers. But I did run the command before, and the outputs indicated communication between nodes are taking place.

With regards to multicast settings, I have the below settings in cluster.conf in both physical host and virtual guest clusters (below is taken from physical host cluster.conf).

	<clusternode name="node_name.domain_name" nodeid="5" votes="1">
		<fence>
			<method name="1">
				<device modulename="" name="drac5"/>
			</method>
		</fence>
		<multicast addr="225.0.0.12" interface="eth3"/>
	</clusternode>

<cman>
	<multicast addr="225.0.0.12"/>
</cman>

In addition, a static route for multicast is also added to ensure the traffic goes through the private ethernet interface eth3. Firewall (iptables) is also configured to allow such traffic to pass.

Lastly, I remembered when I encountered this situation previously I had to run the below command in one of the remaining virtual guests for the cluster to continue operation;

"echo failed_virtual_node_name > /var/run/cluster/fenced_overrride"

Regards,
Bernard Chew

Comment 11 RHEL Program Management 2012-04-02 10:35:26 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 21 Lon Hohberger 2012-10-11 21:07:31 UTC
We've attempted unsuccessfully to reproduce this bug many times.