Hide Forgot
Description of problem: a nfs mount, where the nfs server goes away, can hang literally forever. even with soft nfs mount option, some operations such as sync and umount can hang (depending on an endless matrix of kernel versions). we recently exposed self_fence for netfs, but that's just too late in the process. In some cases, with many nfs umount operation in progress (generally 5 are enough to reproduce), it's possible that we will never hit self_fence. we address the problem by checking if the RPC/NFS/MOUNTD (mountd only for nfs < version 4, that doesn't use mountd) are alive via rpcinfo calls. In case there are no responses from them, we will proceed to just self_fence as there is no point to keep trying. rpcinfo calls do attempt to contact the server 4 times, once every 15 seconds, so this code path could theoretically, if the network or server are only loosing packets, add an extra 3 minutes to a stop operation. How reproducible: 100% Steps to Reproduce: 1. mount a nfs client 2. do a hard disconnect of the nfs server (power off, or pull cable) 3. attempt to unmount the nfs client Actual results: nfs client hangs indefinitely Expected results: nfs client detects server has gone, if umount can not succeed, self fencing occurs.
upstream patch related to this issue. https://github.com/davidvossel/resource-agents/commit/617e52862264e07dce5c0a1b2c693a9073458341
I have verified that with resource-agents-3.9.5-11.el6.x86_64 the node self fence when trying to umount unreacheable nfs mount. # export \ OCF_FUNCTIONS_DIR=/usr/lib/ocf/lib/heartbeat \ OCF_RESKEY_name=nfsmount \ OCF_RESKEY_host=10.34.70.155 \ OCF_RESKEY_mountpoint=/mnt \ OCF_RESKEY_export=/mnt/shared0 \ OCF_RESKEY_fstype=nfs OCF_RESKEY_self_fence=yes # /usr/share/cluster/netfs.sh start # mount | grep shared 10.34.70.155:/mnt/shared0 on /mnt type nfs (rw,sync,soft,noac,vers=4,addr=10.34.70.155,clientaddr=10.34.71.133) # ssh 10.34.70.155 "iptables -I INPUT 1 -s $(hostname -f) -j DROP" # date; /usr/share/cluster/netfs.sh stop Wed Jul 23 15:18:25 CEST 2014 <info> pre unmount: checking if nfs server 10.34.70.155 is alive [netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive <debug> Testing generic rpc access on server 10.34.70.155 with protocol tcp [netfs.sh] Testing generic rpc access on server 10.34.70.155 with protocol tcp <alert> RPC server on 10.34.70.155 with tcp is not responding [netfs.sh] RPC server on 10.34.70.155 with tcp is not responding <alert> NFS server not responding - REBOOTING [netfs.sh] NFS server not responding - REBOOTING </var/log/messages shows the following> ... Jul 23 15:17:55 virt-133 kernel: FS-Cache: Loaded Jul 23 15:17:55 virt-133 kernel: NFS: Registering the id_resolver key type Jul 23 15:17:55 virt-133 kernel: FS-Cache: Netfs 'nfs' registered for caching Jul 23 15:21:52 virt-133 kernel: nfs: server 10.34.70.155 not responding, timed out Jul 23 15:21:52 virt-133 rgmanager[2291]: [netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive Jul 23 15:22:55 virt-133 rgmanager[2349]: [netfs.sh] RPC server on 10.34.70.155 with tcp is not responding Jul 23 15:22:55 virt-133 rgmanager[2353]: [netfs.sh] NFS server not responding - REBOOTING
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1428.html