Bug 1024065

Summary: netfs unmount/self_fence integration
Product: Red Hat Enterprise Linux 6 Reporter: David Vossel <dvossel>
Component: resource-agentsAssignee: David Vossel <dvossel>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.6CC: agk, cluster-maint, djansa, fdinitto, jcastillo, jharriga, jpokorny, jruemker, jsvarova, luvilla, michele, mnovacek
Target Milestone: rcKeywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: 0day
Fixed In Version: resource-agents-3.9.2-41.el6 Doc Type: Bug Fix
Doc Text:
Prior to this update, the netfs agent could hang during a stop operation, even with the self_fence option enabled. With this update, self fence operation is executed sooner in the process, which ensures that NFS client detects server leaving if umount can not succeed, and self fencing occurs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-14 04:59:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1010423, 1027410, 1055424, 1117032    

Description David Vossel 2013-10-28 18:00:35 UTC
Description of problem:

a nfs mount, where the nfs server goes away, can hang literally
forever.

even with soft nfs mount option, some operations such as sync and
umount can hang (depending on an endless matrix of kernel versions).

we recently exposed self_fence for netfs, but that's just too late
in the process. In some cases, with many nfs umount operation
in progress (generally 5 are enough to reproduce), it's possible
that we will never hit self_fence.

we address the problem by checking if the RPC/NFS/MOUNTD (mountd
only for nfs < version 4, that doesn't use mountd) are alive via
rpcinfo calls. In case there are no responses from them, we will
proceed to just self_fence as there is no point to keep
trying.

rpcinfo calls do attempt to contact the server 4 times, once
every 15 seconds, so this code path could theoretically,
if the network or server are only loosing packets, add an
extra 3 minutes to a stop operation.

How reproducible:
100%

Steps to Reproduce:
1. mount a nfs client
2. do a hard disconnect of the nfs server (power off, or pull cable)
3. attempt to unmount the nfs client

Actual results:
nfs client hangs indefinitely


Expected results:
nfs client detects server has gone, if umount can not succeed, self fencing occurs.

Comment 1 David Vossel 2013-10-28 18:01:12 UTC
upstream patch related to this issue.

https://github.com/davidvossel/resource-agents/commit/617e52862264e07dce5c0a1b2c693a9073458341

Comment 8 michal novacek 2014-07-23 13:28:07 UTC
I have verified that with resource-agents-3.9.5-11.el6.x86_64 the node self
fence when trying to umount unreacheable nfs mount.

# export \
OCF_FUNCTIONS_DIR=/usr/lib/ocf/lib/heartbeat \
OCF_RESKEY_name=nfsmount \
OCF_RESKEY_host=10.34.70.155 \
OCF_RESKEY_mountpoint=/mnt \
OCF_RESKEY_export=/mnt/shared0 \
OCF_RESKEY_fstype=nfs OCF_RESKEY_self_fence=yes

# /usr/share/cluster/netfs.sh start
# mount | grep shared
10.34.70.155:/mnt/shared0 on /mnt type nfs (rw,sync,soft,noac,vers=4,addr=10.34.70.155,clientaddr=10.34.71.133)

# ssh 10.34.70.155 "iptables -I INPUT 1 -s $(hostname -f) -j DROP" 

# date; /usr/share/cluster/netfs.sh stop
Wed Jul 23 15:18:25 CEST 2014
<info>   pre unmount: checking if nfs server 10.34.70.155 is alive
[netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive
<debug>  Testing generic rpc access on server 10.34.70.155 with protocol tcp
[netfs.sh] Testing generic rpc access on server 10.34.70.155 with protocol tcp
<alert>  RPC server on 10.34.70.155 with tcp is not responding
[netfs.sh] RPC server on 10.34.70.155 with tcp is not responding
<alert>  NFS server not responding - REBOOTING
[netfs.sh] NFS server not responding - REBOOTING

</var/log/messages shows the following>
...
Jul 23 15:17:55 virt-133 kernel: FS-Cache: Loaded
Jul 23 15:17:55 virt-133 kernel: NFS: Registering the id_resolver key type
Jul 23 15:17:55 virt-133 kernel: FS-Cache: Netfs 'nfs' registered for caching
Jul 23 15:21:52 virt-133 kernel: nfs: server 10.34.70.155 not responding, timed out
Jul 23 15:21:52 virt-133 rgmanager[2291]: [netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive

Jul 23 15:22:55 virt-133 rgmanager[2349]: [netfs.sh] RPC server on 10.34.70.155 with tcp is not responding
Jul 23 15:22:55 virt-133 rgmanager[2353]: [netfs.sh] NFS server not responding - REBOOTING

Comment 9 errata-xmlrpc 2014-10-14 04:59:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1428.html