Bug 1024065 - netfs unmount/self_fence integration
Summary: netfs unmount/self_fence integration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: resource-agents
Version: 6.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: David Vossel
QA Contact: Cluster QE
URL:
Whiteboard: 0day
Depends On:
Blocks: 1010423 1027410 1055424 1117032
TreeView+ depends on / blocked
 
Reported: 2013-10-28 18:00 UTC by David Vossel
Modified: 2018-12-04 16:09 UTC (History)
12 users (show)

Fixed In Version: resource-agents-3.9.2-41.el6
Doc Type: Bug Fix
Doc Text:
Prior to this update, the netfs agent could hang during a stop operation, even with the self_fence option enabled. With this update, self fence operation is executed sooner in the process, which ensures that NFS client detects server leaving if umount can not succeed, and self fencing occurs.
Clone Of:
Environment:
Last Closed: 2014-10-14 04:59:38 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1428 normal SHIPPED_LIVE resource-agents bug fix and enhancement update 2014-10-14 01:06:18 UTC

Description David Vossel 2013-10-28 18:00:35 UTC
Description of problem:

a nfs mount, where the nfs server goes away, can hang literally
forever.

even with soft nfs mount option, some operations such as sync and
umount can hang (depending on an endless matrix of kernel versions).

we recently exposed self_fence for netfs, but that's just too late
in the process. In some cases, with many nfs umount operation
in progress (generally 5 are enough to reproduce), it's possible
that we will never hit self_fence.

we address the problem by checking if the RPC/NFS/MOUNTD (mountd
only for nfs < version 4, that doesn't use mountd) are alive via
rpcinfo calls. In case there are no responses from them, we will
proceed to just self_fence as there is no point to keep
trying.

rpcinfo calls do attempt to contact the server 4 times, once
every 15 seconds, so this code path could theoretically,
if the network or server are only loosing packets, add an
extra 3 minutes to a stop operation.

How reproducible:
100%

Steps to Reproduce:
1. mount a nfs client
2. do a hard disconnect of the nfs server (power off, or pull cable)
3. attempt to unmount the nfs client

Actual results:
nfs client hangs indefinitely


Expected results:
nfs client detects server has gone, if umount can not succeed, self fencing occurs.

Comment 1 David Vossel 2013-10-28 18:01:12 UTC
upstream patch related to this issue.

https://github.com/davidvossel/resource-agents/commit/617e52862264e07dce5c0a1b2c693a9073458341

Comment 8 michal novacek 2014-07-23 13:28:07 UTC
I have verified that with resource-agents-3.9.5-11.el6.x86_64 the node self
fence when trying to umount unreacheable nfs mount.

# export \
OCF_FUNCTIONS_DIR=/usr/lib/ocf/lib/heartbeat \
OCF_RESKEY_name=nfsmount \
OCF_RESKEY_host=10.34.70.155 \
OCF_RESKEY_mountpoint=/mnt \
OCF_RESKEY_export=/mnt/shared0 \
OCF_RESKEY_fstype=nfs OCF_RESKEY_self_fence=yes

# /usr/share/cluster/netfs.sh start
# mount | grep shared
10.34.70.155:/mnt/shared0 on /mnt type nfs (rw,sync,soft,noac,vers=4,addr=10.34.70.155,clientaddr=10.34.71.133)

# ssh 10.34.70.155 "iptables -I INPUT 1 -s $(hostname -f) -j DROP" 

# date; /usr/share/cluster/netfs.sh stop
Wed Jul 23 15:18:25 CEST 2014
<info>   pre unmount: checking if nfs server 10.34.70.155 is alive
[netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive
<debug>  Testing generic rpc access on server 10.34.70.155 with protocol tcp
[netfs.sh] Testing generic rpc access on server 10.34.70.155 with protocol tcp
<alert>  RPC server on 10.34.70.155 with tcp is not responding
[netfs.sh] RPC server on 10.34.70.155 with tcp is not responding
<alert>  NFS server not responding - REBOOTING
[netfs.sh] NFS server not responding - REBOOTING

</var/log/messages shows the following>
...
Jul 23 15:17:55 virt-133 kernel: FS-Cache: Loaded
Jul 23 15:17:55 virt-133 kernel: NFS: Registering the id_resolver key type
Jul 23 15:17:55 virt-133 kernel: FS-Cache: Netfs 'nfs' registered for caching
Jul 23 15:21:52 virt-133 kernel: nfs: server 10.34.70.155 not responding, timed out
Jul 23 15:21:52 virt-133 rgmanager[2291]: [netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive

Jul 23 15:22:55 virt-133 rgmanager[2349]: [netfs.sh] RPC server on 10.34.70.155 with tcp is not responding
Jul 23 15:22:55 virt-133 rgmanager[2353]: [netfs.sh] NFS server not responding - REBOOTING

Comment 9 errata-xmlrpc 2014-10-14 04:59:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1428.html


Note You need to log in before you can comment on or make changes to this bug.