Bug 1504112
| Summary: | nfsserver resource fails to stop due to active threads | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Josef Zimek <pzimek> |
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.5 | CC: | agk, cluster-maint, fdinitto, hbiswas, mmuzikov, mnovacek, oalbrigt, phagara, wei1.guo |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | resource-agents-3.9.5-112.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-04-10 12:09:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I have verified that the patch is present in resource-agents-3.9.5-118.el7.src.rpm as bz1504112-nfsserver-allow-stop-to-timeout.patch and that the package compiles with that patch. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0757 We recently have the same problem with our active/standby nfs cluster using pacemaker. Although we are using the patched code level, the problem still occurs and causes fencing and failover. Please advise. We are using RHEL 7.3 OS but installing HA packages with CentOS repo. from rhel-x86_64-server-7 kernel: 3.10.0-514.el7.x86_64 OS: Red Hat Enterprise Linux Server release 7.3 (Maipo) nfs-utils-1.3.0-0.54.el7.x86_64 from centos-7-base: resource-agents-3.9.5-124.el7.x86_64 pacemaker-cli-1.1.18-11.el7.x86_64 pacemaker-libs-1.1.18-11.el7.x86_64 pacemaker-cluster-libs-1.1.18-11.el7.x86_64 pacemaker-1.1.18-11.el7.x86_64 corosynclib-2.4.3-2.el7.x86_64 corosync-2.4.3-2.el7.x86_64 The error message: Dec 4 18:53:41 nfs1 nfsserver(nfsDaemon)[118008]: INFO: Status: rpcbind Dec 4 18:53:42 nfs1 nfsserver(nfsDaemon)[118008]: INFO: Status: nfs-mountd Dec 4 18:53:52 nfs1 nfsserver(nfsDaemon)[118008]: ERROR: nfs-mountd is not running Dec 4 18:53:52 nfs1 lrmd[8316]: notice: nfsDaemon_monitor_10000:118008:stderr [ ocf-exit-reason:nfs-mountd is not running ] Dec 4 18:53:52 nfs1 crmd[8319]: notice: nfs1-nfsDaemon_monitor_10000:41 [ ocf-exit-reason:nfs-mountd is not running\n ] Dec 4 18:53:52 nfs1 crmd[8319]: notice: State transition S_IDLE -> S_POLICY_ENGINE Dec 4 18:53:52 nfs1 pengine[8318]: warning: Processing failed op monitor for nfsDaemon on nfs1: not running (7) Dec 4 18:53:52 nfs1 pengine[8318]: notice: * Recover nfsDaemon ( nfs1 ) Dec 4 18:53:52 nfs1 pengine[8318]: notice: Calculated transition 34, saving inputs in /var/lib/pacemaker/pengine/pe-input-191.bz2 Dec 4 18:53:52 nfs1 pengine[8318]: warning: Processing failed op monitor for nfsDaemon on nfs1: not running (7) Dec 4 18:53:52 nfs1 pengine[8318]: notice: * Recover nfsDaemon ( nfs1 ) Dec 4 18:53:52 nfs1 pengine[8318]: notice: Calculated transition 35, saving inputs in /var/lib/pacemaker/pengine/pe-input-192.bz2 Dec 4 18:53:52 nfs1 crmd[8319]: notice: Initiating stop operation nfsDaemon_stop_0 locally on nfs1 Dec 4 18:53:52 nfs1 crmd[8319]: notice: nfs1-nfsDaemon_monitor_10000:41 [ ocf-exit-reason:nfs-mountd is not running\n ] Dec 4 18:53:52 nfs1 nfsserver(nfsDaemon)[118138]: INFO: Stopping NFS server ... Dec 4 18:53:52 nfs1 systemd: Stopping NFS server and services... Dec 4 18:53:52 nfs1 systemd: Stopped NFS server and services. Dec 4 18:53:52 nfs1 systemd: Stopping NFS Mount Daemon... Dec 4 18:53:52 nfs1 systemd: Stopping NFSv4 ID-name mapping service... Dec 4 18:53:52 nfs1 systemd: Stopped NFSv4 ID-name mapping service. Dec 4 18:53:52 nfs1 nfsserver(nfsDaemon)[118138]: INFO: Stop: threads Dec 4 18:53:52 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:53:53 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:53:54 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:53:55 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:53:56 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:53:57 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:53:58 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:53:59 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:00 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:01 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:02 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:03 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:04 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:05 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:06 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:07 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:08 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:09 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:10 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:11 nfs1 nfsserver(nfsDaemon)[118138]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Dec 4 18:54:12 nfs1 lrmd[8316]: warning: nfsDaemon_stop_0 process (PID 118138) timed out Dec 4 18:54:12 nfs1 lrmd[8316]: warning: nfsDaemon_stop_0:118138 - timed out after 20000ms Dec 4 18:54:12 nfs1 crmd[8319]: error: Result of stop operation for nfsDaemon on nfs1: Timed Out Dec 4 18:54:12 nfs1 crmd[8319]: warning: Action 8 (nfsDaemon_stop_0) on nfs1 failed (target: 0 vs. rc: 1): Error Dec 4 18:54:12 nfs1 crmd[8319]: notice: Transition aborted by operation nfsDaemon_stop_0 'modify' on nfs1: Event failed Dec 4 18:54:12 nfs1 crmd[8319]: warning: Action 8 (nfsDaemon_stop_0) on nfs1 failed (target: 0 vs. rc: 1): Error Dec 4 18:54:12 nfs1 crmd[8319]: notice: Transition 35 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-192.bz2): Complete Dec 4 18:54:12 nfs1 pengine[8318]: warning: Processing failed op stop for nfsDaemon on nfs1: unknown error (1) Dec 4 18:54:12 nfs1 pengine[8318]: warning: Processing failed op stop for nfsDaemon on nfs1: unknown error (1) Dec 4 18:54:12 nfs1 pengine[8318]: warning: Cluster node nfs1 will be fenced: nfsDaemon failed there Dec 4 18:54:12 nfs1 pengine[8318]: warning: Scheduling Node nfs1 for STONITH Dec 4 18:54:12 nfs1 pengine[8318]: notice: Stop of failed resource nfsDaemon is implicit after nfs1 is fenced Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Fence (reboot) nfs1 'nfsDaemon failed there' Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Move clusterVIP ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Move clusterLVM ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Move clusterhome ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Recover nfsDaemon ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Move clusterVIPtcp ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Stop nfs1Fencenfs0 ( nfs1 ) due to node availability Dec 4 18:54:12 nfs1 pengine[8318]: warning: Calculated transition 36 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2 Dec 4 18:54:12 nfs1 pengine[8318]: warning: Processing failed op stop for nfsDaemon on nfs1: unknown error (1) Dec 4 18:54:12 nfs1 pengine[8318]: warning: Processing failed op stop for nfsDaemon on nfs1: unknown error (1) Dec 4 18:54:12 nfs1 pengine[8318]: warning: Cluster node nfs1 will be fenced: nfsDaemon failed there Dec 4 18:54:12 nfs1 pengine[8318]: warning: Forcing nfsDaemon away from nfs1 after 1000000 failures (max=1000000) Dec 4 18:54:12 nfs1 pengine[8318]: warning: Scheduling Node nfs1 for STONITH Dec 4 18:54:12 nfs1 pengine[8318]: notice: Stop of failed resource nfsDaemon is implicit after nfs1 is fenced Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Fence (reboot) nfs1 'nfsDaemon failed there' Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Move clusterVIP ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Move clusterLVM ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Move clusterhome ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Recover nfsDaemon ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Move clusterVIPtcp ( nfs1 -> nfs0 ) Dec 4 18:54:12 nfs1 pengine[8318]: notice: * Stop nfs1Fencenfs0 ( nfs1 ) due to node availability Dec 4 18:54:12 nfs1 pengine[8318]: warning: Calculated transition 37 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-4.bz2 Dec 4 18:54:12 nfs1 crmd[8319]: notice: Requesting fencing (reboot) of node nfs1 Dec 4 18:54:12 nfs1 stonith-ng[8315]: notice: Client crmd.8319.0d1bcfe1 wants to fence (reboot) 'nfs1' with device '(any)' Dec 4 18:54:12 nfs1 stonith-ng[8315]: notice: Requesting peer fencing (reboot) of nfs1 Dec 4 18:54:13 nfs1 systemd-logind: Power key pressed. Dec 4 18:54:13 nfs1 systemd-logind: Powering Off... Dec 4 18:54:13 nfs1 systemd-logind: System is powering down. The stop-action times out, so you can try increasing the stop timeout for the resource. I will give it a shot with op stop timeout=35. But the nfsserver_stop is not the cause but the outcome, right?Thanks. It is because there are still active NFS threads, so you could try debugging why NFS is slow to stop. You could also try the newest version of resource-agents available in RHEL7/CentOS7 to see if some of the additional bugfixes/improvements there might solve the issue for you. |
Description of problem: nfsserver resource fails to stop due to present threads: Jun 9 18:58:34 node1 crmd[6350]: notice: Initiating stop operation nfsserver1_stop_0 locally on node1.dot.com Jun 9 18:58:34 node1 nfsserver(nfsserver1)[1020]: INFO: Stopping NFS server ... Jun 9 18:58:34 node1 systemd: Stopping NFS server and services... Jun 9 18:58:34 node1 systemd: Stopped NFS server and services. Jun 9 18:58:34 node1 systemd: Stopping NFS Mount Daemon... Jun 9 18:58:34 node1 systemd: Stopping NFSv4 ID-name mapping service... Jun 9 18:58:34 node1 systemd: Stopped NFSv4 ID-name mapping service. Jun 9 18:58:34 node1 nfsserver(nfsserver1)[1020]: INFO: Stop: threads Jun 9 18:58:34 node1 rpc.mountd[14696]: Caught signal 15, un-registering and exiting. Jun 9 18:58:34 node1 systemd: Stopped NFS Mount Daemon. Jun 9 18:58:34 node1 nfsserver(nfsserver1)[1020]: ERROR: NFS server failed to stop: /proc/fs/nfsd/threads Jun 9 18:58:34 node1 lrmd[6341]: notice: nfsserver1_stop_0:1020:stderr [ ocf-exit-reason:NFS server failed to stop: /proc/fs/nfsd/threads ] Jun 9 18:58:34 node1 crmd[6350]: notice: Result of stop operation for nfsserver1 on node1.dot.com: 1 (unknown error) Jun 9 18:58:34 node1 crmd[6350]: notice: node1.dot.com-nfsserver1_stop_0:467 [ ocf-exit-reason:NFS server failed to stop: /proc/fs/nfsd/threads\n ] Jun 9 18:58:34 node1 crmd[6350]: warning: Action 171 (nfsserver1_stop_0) on node1.dot.com failed (target: 0 vs. rc: 1): Error Jun 9 18:58:34 node1 crmd[6350]: notice: Transition aborted by operation nfsserver1_stop_0 'modify' on node1.dot.com: Event failed Jun 9 18:58:34 node1 crmd[6350]: warning: Action 171 (nfsserver1_stop_0) on node1.dot.com failed (target: 0 vs. rc: 1): Error This is piece of nfsserver resource agent code with corresponding task during stop function: case $EXEC_MODE in [23]) ocf_log info "Stop: threads" tfn="/proc/fs/nfsd/threads" if [ -f "$tfn" ] && [ "$(cat $tfn)" -gt "0" ]; then ocf_exit_reason "NFS server failed to stop: /proc/fs/nfsd/threads" return $OCF_ERR_GENERIC fi Version-Release number of selected component (if applicable): resource-agents-3.9.5-82.el7_3.9.x86_64 How reproducible: randomly - depends on presence of threads Actual results: nfsserver resource immediately fails to stop - this typically leads to unnecessary fencing of node due to on-fail=fence policy of stop operation Expected results: nfsserver resource handles active threads in more meaningful way i.e. taking into account stop operation timeout to let threads to clean