1024065 – netfs unmount/self_fence integration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1024065 - netfs unmount/self_fence integration

Summary: netfs unmount/self_fence integration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	6.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Vossel
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:	0day
Depends On:
Blocks:	1010423 1027410 1055424 1117032
TreeView+	depends on / blocked

Reported:	2013-10-28 18:00 UTC by David Vossel
Modified:	2018-12-04 16:09 UTC (History)
CC List:	12 users (show)
Fixed In Version:	resource-agents-3.9.2-41.el6
Doc Type:	Bug Fix
Doc Text:	Prior to this update, the netfs agent could hang during a stop operation, even with the self_fence option enabled. With this update, self fence operation is executed sooner in the process, which ensures that NFS client detects server leaving if umount can not succeed, and self fencing occurs.
Clone Of:
Environment:
Last Closed:	2014-10-14 04:59:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1428	0	normal	SHIPPED_LIVE	resource-agents bug fix and enhancement update	2014-10-14 01:06:18 UTC

Description David Vossel 2013-10-28 18:00:35 UTC

Description of problem:

a nfs mount, where the nfs server goes away, can hang literally
forever.

even with soft nfs mount option, some operations such as sync and
umount can hang (depending on an endless matrix of kernel versions).

we recently exposed self_fence for netfs, but that's just too late
in the process. In some cases, with many nfs umount operation
in progress (generally 5 are enough to reproduce), it's possible
that we will never hit self_fence.

we address the problem by checking if the RPC/NFS/MOUNTD (mountd
only for nfs < version 4, that doesn't use mountd) are alive via
rpcinfo calls. In case there are no responses from them, we will
proceed to just self_fence as there is no point to keep
trying.

rpcinfo calls do attempt to contact the server 4 times, once
every 15 seconds, so this code path could theoretically,
if the network or server are only loosing packets, add an
extra 3 minutes to a stop operation.

How reproducible:
100%

Steps to Reproduce:
1. mount a nfs client
2. do a hard disconnect of the nfs server (power off, or pull cable)
3. attempt to unmount the nfs client

Actual results:
nfs client hangs indefinitely


Expected results:
nfs client detects server has gone, if umount can not succeed, self fencing occurs.

Comment 1 David Vossel 2013-10-28 18:01:12 UTC

upstream patch related to this issue.

https://github.com/davidvossel/resource-agents/commit/617e52862264e07dce5c0a1b2c693a9073458341

Comment 8 michal novacek 2014-07-23 13:28:07 UTC

I have verified that with resource-agents-3.9.5-11.el6.x86_64 the node self
fence when trying to umount unreacheable nfs mount.

# export \
OCF_FUNCTIONS_DIR=/usr/lib/ocf/lib/heartbeat \
OCF_RESKEY_name=nfsmount \
OCF_RESKEY_host=10.34.70.155 \
OCF_RESKEY_mountpoint=/mnt \
OCF_RESKEY_export=/mnt/shared0 \
OCF_RESKEY_fstype=nfs OCF_RESKEY_self_fence=yes

# /usr/share/cluster/netfs.sh start
# mount | grep shared
10.34.70.155:/mnt/shared0 on /mnt type nfs (rw,sync,soft,noac,vers=4,addr=10.34.70.155,clientaddr=10.34.71.133)

# ssh 10.34.70.155 "iptables -I INPUT 1 -s $(hostname -f) -j DROP" 

# date; /usr/share/cluster/netfs.sh stop
Wed Jul 23 15:18:25 CEST 2014
<info>   pre unmount: checking if nfs server 10.34.70.155 is alive
[netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive
<debug>  Testing generic rpc access on server 10.34.70.155 with protocol tcp
[netfs.sh] Testing generic rpc access on server 10.34.70.155 with protocol tcp
<alert>  RPC server on 10.34.70.155 with tcp is not responding
[netfs.sh] RPC server on 10.34.70.155 with tcp is not responding
<alert>  NFS server not responding - REBOOTING
[netfs.sh] NFS server not responding - REBOOTING

</var/log/messages shows the following>
...
Jul 23 15:17:55 virt-133 kernel: FS-Cache: Loaded
Jul 23 15:17:55 virt-133 kernel: NFS: Registering the id_resolver key type
Jul 23 15:17:55 virt-133 kernel: FS-Cache: Netfs 'nfs' registered for caching
Jul 23 15:21:52 virt-133 kernel: nfs: server 10.34.70.155 not responding, timed out
Jul 23 15:21:52 virt-133 rgmanager[2291]: [netfs.sh] pre unmount: checking if nfs server 10.34.70.155 is alive

Jul 23 15:22:55 virt-133 rgmanager[2349]: [netfs.sh] RPC server on 10.34.70.155 with tcp is not responding
Jul 23 15:22:55 virt-133 rgmanager[2353]: [netfs.sh] NFS server not responding - REBOOTING

Comment 9 errata-xmlrpc 2014-10-14 04:59:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1428.html

Note You need to log in before you can comment on or make changes to this bug.