1504112 – nfsserver resource fails to stop due to active threads

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1504112 - nfsserver resource fails to stop due to active threads

Summary: nfsserver resource fails to stop due to active threads

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-19 13:59 UTC by Josef Zimek
Modified:	2021-12-10 15:20 UTC (History)
CC List:	9 users (show)
Fixed In Version:	resource-agents-3.9.5-112.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-10 12:09:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5911371	0	None	None	None	2021-03-29 09:39:06 UTC
Red Hat Product Errata	RHBA-2018:0757	0	None	None	None	2018-04-10 12:10:40 UTC

Description Josef Zimek 2017-10-19 13:59:28 UTC

Description of problem:


nfsserver resource fails to stop due to present threads:

Jun  9 18:58:34 node1 crmd[6350]:  notice: Initiating stop operation nfsserver1_stop_0 locally on node1.dot.com
Jun  9 18:58:34 node1 nfsserver(nfsserver1)[1020]: INFO: Stopping NFS server ...
Jun  9 18:58:34 node1 systemd: Stopping NFS server and services...
Jun  9 18:58:34 node1 systemd: Stopped NFS server and services.
Jun  9 18:58:34 node1 systemd: Stopping NFS Mount Daemon...
Jun  9 18:58:34 node1 systemd: Stopping NFSv4 ID-name mapping service...
Jun  9 18:58:34 node1 systemd: Stopped NFSv4 ID-name mapping service.
Jun  9 18:58:34 node1 nfsserver(nfsserver1)[1020]: INFO: Stop: threads
Jun  9 18:58:34 node1 rpc.mountd[14696]: Caught signal 15, un-registering and exiting.
Jun  9 18:58:34 node1 systemd: Stopped NFS Mount Daemon.
Jun  9 18:58:34 node1 nfsserver(nfsserver1)[1020]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Jun  9 18:58:34 node1 lrmd[6341]:  notice: nfsserver1_stop_0:1020:stderr [ ocf-exit-reason:NFS server failed to stop: /proc/fs/nfsd/threads ]
Jun  9 18:58:34 node1 crmd[6350]:  notice: Result of stop operation for nfsserver1 on node1.dot.com: 1 (unknown error)
Jun  9 18:58:34 node1 crmd[6350]:  notice: node1.dot.com-nfsserver1_stop_0:467 [ ocf-exit-reason:NFS server failed to stop: /proc/fs/nfsd/threads\n ] 
Jun  9 18:58:34 node1 crmd[6350]: warning: Action 171 (nfsserver1_stop_0) on node1.dot.com failed (target: 0 vs. rc: 1): Error
Jun  9 18:58:34 node1 crmd[6350]:  notice: Transition aborted by operation nfsserver1_stop_0 'modify' on node1.dot.com: Event failed
Jun  9 18:58:34 node1 crmd[6350]: warning: Action 171 (nfsserver1_stop_0) on node1.dot.com failed (target: 0 vs. rc: 1): Error



This is piece of nfsserver resource agent code with corresponding task during stop function:
 

	case $EXEC_MODE in
            [23]) ocf_log info "Stop: threads"
		  tfn="/proc/fs/nfsd/threads"
		  if [ -f "$tfn" ] && [ "$(cat $tfn)" -gt "0" ]; then
			ocf_exit_reason "NFS server failed to stop: /proc/fs/nfsd/threads"
			return $OCF_ERR_GENERIC
fi




Version-Release number of selected component (if applicable):
resource-agents-3.9.5-82.el7_3.9.x86_64

How reproducible:
randomly - depends on presence of threads


Actual results:
nfsserver resource immediately fails to stop  - this typically leads to unnecessary fencing of node due to on-fail=fence policy of stop operation

Expected results:
nfsserver resource handles active threads in more meaningful way i.e. taking into account stop operation timeout to let threads to clean

Comment 3 Oyvind Albrigtsen 2017-11-02 11:54:35 UTC

https://github.com/ClusterLabs/resource-agents/pull/1045

Comment 6 Monika Muzikovska 2018-01-04 11:06:41 UTC

I have verified that the patch is present in resource-agents-3.9.5-118.el7.src.rpm as bz1504112-nfsserver-allow-stop-to-timeout.patch and that the package compiles with that patch.

Comment 10 errata-xmlrpc 2018-04-10 12:09:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0757

Comment 11 Wei Guo 2018-12-07 17:01:26 UTC

We recently have the same problem with our active/standby nfs cluster using pacemaker. 
Although we are using the patched code level, the problem still occurs and causes fencing and failover.
Please advise. 

We are using RHEL 7.3 OS but installing HA packages with CentOS repo. 

from rhel-x86_64-server-7
kernel: 3.10.0-514.el7.x86_64
OS: Red Hat Enterprise Linux Server release 7.3 (Maipo)
nfs-utils-1.3.0-0.54.el7.x86_64

from centos-7-base:
resource-agents-3.9.5-124.el7.x86_64
pacemaker-cli-1.1.18-11.el7.x86_64
pacemaker-libs-1.1.18-11.el7.x86_64
pacemaker-cluster-libs-1.1.18-11.el7.x86_64
pacemaker-1.1.18-11.el7.x86_64
corosynclib-2.4.3-2.el7.x86_64
corosync-2.4.3-2.el7.x86_64

The error message: 
Dec  4 18:53:41 nfs1 nfsserver(nfsDaemon)[118008]: INFO: Status: rpcbind
Dec  4 18:53:42 nfs1 nfsserver(nfsDaemon)[118008]: INFO: Status: nfs-mountd
Dec  4 18:53:52 nfs1 nfsserver(nfsDaemon)[118008]: ERROR: nfs-mountd is not running
Dec  4 18:53:52 nfs1 lrmd[8316]:  notice: nfsDaemon_monitor_10000:118008:stderr [ ocf-exit-reason:nfs-mountd is not running ]
Dec  4 18:53:52 nfs1 crmd[8319]:  notice: nfs1-nfsDaemon_monitor_10000:41 [ ocf-exit-reason:nfs-mountd is not running\n ]
Dec  4 18:53:52 nfs1 crmd[8319]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
Dec  4 18:53:52 nfs1 pengine[8318]: warning: Processing failed op monitor for nfsDaemon on nfs1: not running (7)
Dec  4 18:53:52 nfs1 pengine[8318]:  notice:  * Recover    nfsDaemon         (         nfs1 )
Dec  4 18:53:52 nfs1 pengine[8318]:  notice: Calculated transition 34, saving inputs in /var/lib/pacemaker/pengine/pe-input-191.bz2
Dec  4 18:53:52 nfs1 pengine[8318]: warning: Processing failed op monitor for nfsDaemon on nfs1: not running (7)
Dec  4 18:53:52 nfs1 pengine[8318]:  notice:  * Recover    nfsDaemon         (         nfs1 )
Dec  4 18:53:52 nfs1 pengine[8318]:  notice: Calculated transition 35, saving inputs in /var/lib/pacemaker/pengine/pe-input-192.bz2
Dec  4 18:53:52 nfs1 crmd[8319]:  notice: Initiating stop operation nfsDaemon_stop_0 locally on nfs1
Dec  4 18:53:52 nfs1 crmd[8319]:  notice: nfs1-nfsDaemon_monitor_10000:41 [ ocf-exit-reason:nfs-mountd is not running\n ]
Dec  4 18:53:52 nfs1 nfsserver(nfsDaemon)[118138]: INFO: Stopping NFS server ...
Dec  4 18:53:52 nfs1 systemd: Stopping NFS server and services...
Dec  4 18:53:52 nfs1 systemd: Stopped NFS server and services.
Dec  4 18:53:52 nfs1 systemd: Stopping NFS Mount Daemon...
Dec  4 18:53:52 nfs1 systemd: Stopping NFSv4 ID-name mapping service...
Dec  4 18:53:52 nfs1 systemd: Stopped NFSv4 ID-name mapping service.
Dec  4 18:53:52 nfs1 nfsserver(nfsDaemon)[118138]: INFO: Stop: threads
Dec  4 18:53:52 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:53:53 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:53:54 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:53:55 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:53:56 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:53:57 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:53:58 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:53:59 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:00 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:01 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:02 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:03 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:04 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:05 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:06 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:07 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:08 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:09 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:10 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:11 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads
Dec  4 18:54:12 nfs1 lrmd[8316]: warning: nfsDaemon_stop_0 process (PID 118138) timed out
Dec  4 18:54:12 nfs1 lrmd[8316]: warning: nfsDaemon_stop_0:118138 - timed out after 20000ms
Dec  4 18:54:12 nfs1 crmd[8319]:   error: Result of stop operation for nfsDaemon on nfs1: Timed Out
Dec  4 18:54:12 nfs1 crmd[8319]: warning: Action 8 (nfsDaemon_stop_0) on nfs1 failed (target: 0 vs. rc: 1): Error
Dec  4 18:54:12 nfs1 crmd[8319]:  notice: Transition aborted by operation nfsDaemon_stop_0 'modify' on nfs1: Event failed
Dec  4 18:54:12 nfs1 crmd[8319]: warning: Action 8 (nfsDaemon_stop_0) on nfs1 failed (target: 0 vs. rc: 1): Error
Dec  4 18:54:12 nfs1 crmd[8319]:  notice: Transition 35 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-192.bz2): Complete
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Processing failed op stop for nfsDaemon on nfs1: unknown error (1)
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Processing failed op stop for nfsDaemon on nfs1: unknown error (1)
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Cluster node nfs1 will be fenced: nfsDaemon failed there
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Scheduling Node nfs1 for STONITH
Dec  4 18:54:12 nfs1 pengine[8318]:  notice: Stop of failed resource nfsDaemon is implicit after nfs1 is fenced
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Fence (reboot) nfs1 'nfsDaemon failed there'
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Move       clusterVIP        ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Move       clusterLVM        ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Move       clusterhome       ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Recover    nfsDaemon         ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Move       clusterVIPtcp     ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Stop       nfs1Fencenfs0     (         nfs1 )   due to node availability
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Calculated transition 36 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Processing failed op stop for nfsDaemon on nfs1: unknown error (1)
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Processing failed op stop for nfsDaemon on nfs1: unknown error (1)
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Cluster node nfs1 will be fenced: nfsDaemon failed there
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Forcing nfsDaemon away from nfs1 after 1000000 failures (max=1000000)
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Scheduling Node nfs1 for STONITH
Dec  4 18:54:12 nfs1 pengine[8318]:  notice: Stop of failed resource nfsDaemon is implicit after nfs1 is fenced
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Fence (reboot) nfs1 'nfsDaemon failed there'
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Move       clusterVIP        ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Move       clusterLVM        ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Move       clusterhome       ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Recover    nfsDaemon         ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Move       clusterVIPtcp     ( nfs1 -> nfs0 )
Dec  4 18:54:12 nfs1 pengine[8318]:  notice:  * Stop       nfs1Fencenfs0     (         nfs1 )   due to node availability
Dec  4 18:54:12 nfs1 pengine[8318]: warning: Calculated transition 37 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-4.bz2
Dec  4 18:54:12 nfs1 crmd[8319]:  notice: Requesting fencing (reboot) of node nfs1
Dec  4 18:54:12 nfs1 stonith-ng[8315]:  notice: Client crmd.8319.0d1bcfe1 wants to fence (reboot) 'nfs1' with device '(any)'
Dec  4 18:54:12 nfs1 stonith-ng[8315]:  notice: Requesting peer fencing (reboot) of nfs1
Dec  4 18:54:13 nfs1 systemd-logind: Power key pressed.
Dec  4 18:54:13 nfs1 systemd-logind: Powering Off...
Dec  4 18:54:13 nfs1 systemd-logind: System is powering down.

Comment 12 Oyvind Albrigtsen 2019-01-02 11:05:00 UTC

The stop-action times out, so you can try increasing the stop timeout for the resource.

Comment 13 Wei Guo 2019-01-02 19:52:08 UTC

I will give it a shot with op stop timeout=35. But the nfsserver_stop is not the cause but the outcome, right?Thanks.

Comment 14 Oyvind Albrigtsen 2019-01-03 08:11:51 UTC

It is because there are still active NFS threads, so you could try debugging why NFS is slow to stop.

You could also try the newest version of resource-agents available in RHEL7/CentOS7 to see if some of the additional bugfixes/improvements there might solve the issue for you.

Note You need to log in before you can comment on or make changes to this bug.