Description of problem:
a nfs mount, where the nfs server goes away, can hang literally
even with soft nfs mount option, some operations such as sync and
umount can hang (depending on an endless matrix of kernel versions).
we recently exposed self_fence for netfs, but that's just too late
in the process. In some cases, with many nfs umount operation
in progress (generally 5 are enough to reproduce), it's possible
that we will never hit self_fence.
we address the problem by checking if the RPC/NFS/MOUNTD (mountd
only for nfs < version 4, that doesn't use mountd) are alive via
rpcinfo calls. In case there are no responses from them, we will
proceed to just self_fence as there is no point to keep
rpcinfo calls do attempt to contact the server 4 times, once
every 15 seconds, so this code path could theoretically,
if the network or server are only loosing packets, add an
extra 3 minutes to a stop operation.
Steps to Reproduce:
1. mount a nfs client
2. do a hard disconnect of the nfs server (power off, or pull cable)
3. attempt to unmount the nfs client
nfs client hangs indefinitely
nfs client detects server has gone, if umount can not succeed, self fencing occurs.
upstream patch related to this issue.
I have verified that with resource-agents-3.9.5-11.el6.x86_64 the node self
fence when trying to umount unreacheable nfs mount.
# export \
# /usr/share/cluster/netfs.sh start
# mount | grep shared
10.34.70.155:/mnt/shared0 on /mnt type nfs (rw,sync,soft,noac,vers=4,addr=10.34.70.155,clientaddr=10.34.71.133)
# ssh 10.34.70.155 "iptables -I INPUT 1 -s $(hostname -f) -j DROP"
# date; /usr/share/cluster/netfs.sh stop
Wed Jul 23 15:18:25 CEST 2014
<info> pre unmount: checking if nfs server 10.34.70.155 is alive
[netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive
<debug> Testing generic rpc access on server 10.34.70.155 with protocol tcp
[netfs.sh] Testing generic rpc access on server 10.34.70.155 with protocol tcp
<alert> RPC server on 10.34.70.155 with tcp is not responding
[netfs.sh] RPC server on 10.34.70.155 with tcp is not responding
<alert> NFS server not responding - REBOOTING
[netfs.sh] NFS server not responding - REBOOTING
</var/log/messages shows the following>
Jul 23 15:17:55 virt-133 kernel: FS-Cache: Loaded
Jul 23 15:17:55 virt-133 kernel: NFS: Registering the id_resolver key type
Jul 23 15:17:55 virt-133 kernel: FS-Cache: Netfs 'nfs' registered for caching
Jul 23 15:21:52 virt-133 kernel: nfs: server 10.34.70.155 not responding, timed out
Jul 23 15:21:52 virt-133 rgmanager: [netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive
Jul 23 15:22:55 virt-133 rgmanager: [netfs.sh] RPC server on 10.34.70.155 with tcp is not responding
Jul 23 15:22:55 virt-133 rgmanager: [netfs.sh] NFS server not responding - REBOOTING
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.