Hide Forgot
* Description of problem: When using netfs in a cluster to mount a NFS export from a third server, if the nfs share get disconnected or if the nfs server crash, rgmanager does not notice that the export is not available anymore because the mountpoint remains mounted, and the "status" check just checks if it is listed in the output of the command "mount". * How reproducible: Ttwo nodes cluster, mounting a NFS share as a service from a third server. All servers virtualized, and updated up to 5.6. <?xml version="1.0"?> <cluster config_version="5" name="rhcs5"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="jc_mycluster1" nodeid="1" votes="1"> <fence/> </clusternode> <clusternode name="jc_mycluster2" nodeid="2" votes="1"> <fence/> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices/> <rm> <failoverdomains/> <resources> <netfs export="/nfstest" force_unmount="1" fstype="nfs" host="192.168.122.129" mountpoint="/data/nfstest" name="nfstest_data" options="rw,sync"/> </resources> <service autostart="1" exclusive="0" name="nfs_client_test" recovery="relocate"> <netfs ref="nfstest_data"/> </service> </rm> </cluster> 1. Start the service that contains the netfs resource. 2. Disconnect the NFS server that exports the mountpoint to the cluster from the nodes, either using iptables, ifconfig down, or switching it off. 3. Check in the nodes the status of the service with the command 'clustat'. Actual results: The command 'clustat' will show it with "State" "started", and no failover will occur because the result of "status/monitor" will not return error. Any command trying to access the mountpoint will hang. Expected results: When the server gets disconnected from the nodes, rgmanager should be able to detect it is unaccessible and stop the service.
NFS by default never returns I/O errors to userspace applications; this means that all things accessing the mount point (including rgmanager) will hang. As such, you have two options: Add these options to netfs *resource definition*: options="soft,tcp" Add this directive to netfs *reference*: __enforce_timeouts="1" You can tune the soft I/O timeout using retrans=x and timeo=x nfs options (see nfs(5)). The rgmanager __enforce_timeouts directive is not likely to actually resolve the issue; it will cause the service to go in to recovery but the mount point may fail to unmount when rgmanager brings the service down.