Bug 674508

Summary: rgmanager does not detect when a remote nfs share ressources is not available
Product: Red Hat Enterprise Linux 5 Reporter: Pierre Amadio <pamadio>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.8CC: cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-02 16:16:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Pierre Amadio 2011-02-02 08:50:22 UTC
* Description of problem:

When using netfs in a cluster to mount a NFS export from a third server, if the nfs share get disconnected or if the nfs server crash, rgmanager does not notice that the export is not available anymore because the mountpoint remains mounted, and the "status" check just checks if it is listed in the output of the command "mount". 


* How reproducible:

Ttwo nodes cluster, mounting a NFS share as a service from a third server.
All servers virtualized, and updated up to 5.6. 

<?xml version="1.0"?>
<cluster config_version="5" name="rhcs5">
	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="jc_mycluster1" nodeid="1" votes="1">
			<fence/>
		</clusternode>
		<clusternode name="jc_mycluster2" nodeid="2" votes="1">
			<fence/>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices/>
	<rm>
		<failoverdomains/>
		<resources>
			<netfs export="/nfstest" force_unmount="1" fstype="nfs" host="192.168.122.129" mountpoint="/data/nfstest" name="nfstest_data" options="rw,sync"/>
		</resources>
		<service autostart="1" exclusive="0" name="nfs_client_test" recovery="relocate">
			<netfs ref="nfstest_data"/>
		</service>
	</rm>
</cluster>


1. Start the service that contains the netfs resource.
2. Disconnect the NFS server that exports the mountpoint to the cluster from the nodes, either using iptables, ifconfig down, or switching it off.
3. Check in the nodes the status of the service with the command 'clustat'.
  
Actual results:
The command 'clustat' will show it with "State" "started", and no failover will occur because the result of "status/monitor" will not return error.
Any command trying to access the mountpoint will hang.

Expected results:
When the server gets disconnected from the nodes, rgmanager should be able to detect it is unaccessible and stop the service.

Comment 1 Lon Hohberger 2011-02-02 16:16:45 UTC
NFS by default never returns I/O errors to userspace applications; this means that all things accessing the mount point (including rgmanager) will hang.  As such, you have two options:

  Add these options to netfs *resource definition*:

    options="soft,tcp"       

  Add this directive to netfs *reference*:

    __enforce_timeouts="1"

You can tune the soft I/O timeout using retrans=x and timeo=x nfs options (see nfs(5)).  The rgmanager __enforce_timeouts directive is not likely to actually resolve the issue; it will cause the service to go in to recovery but the mount point may fail to unmount when rgmanager brings the service down.