1337565 – "nfs-grace-monitor" timed out messages observed

Bug 1337565 - "nfs-grace-monitor" timed out messages observed

Summary: "nfs-grace-monitor" timed out messages observed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Kaleb KEITHLEY
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351522
TreeView+	depends on / blocked

Reported:	2016-05-19 13:24 UTC by Shashank Raj
Modified:	2017-03-23 05:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.8.4-6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1394224 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:32:07 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Shashank Raj 2016-05-19 13:24:53 UTC

Description of problem:

"nfs-grace-monitor" timed out messages observed in pcs status.

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-5.el7rhgs.x86_64
nfs-ganesha-2.3.1-7.el7rhgs.x86_64

How reproducible:
2/2

Steps to Reproduce:
1.Create a 4 node ganesha cluster.
2.Create a volume, enable ganesha and mount the volume with vers=4.
3.Start executing iozone test suite on the mount point.
4.During the test, stop nfs-ganesha service on the mounted node and once the IO resumes from other node, after sometime start the ganesha service on that node.
5.Check the pcs status and observe that there are timed out messages for nfs-grace-monitor and nfs-mon-monitor as below:

Failed Actions:
* nfs-mon_monitor_10000 on dhcp46-26.lab.eng.blr.redhat.com 'unknown error' (1): call=21, status=Timed Out, exitreason='none',
    last-rc-change='Thu May 19 18:35:08 2016', queued=0ms, exec=0ms
* nfs-grace_monitor_5000 on dhcp46-202.lab.eng.blr.redhat.com 'unknown error' (1): call=22, status=Timed Out, exitreason='none',
    last-rc-change='Thu May 19 00:53:52 2016', queued=0ms, exec=0ms
* nfs-mon_monitor_10000 on dhcp46-202.lab.eng.blr.redhat.com 'unknown error' (1): call=21, status=Timed Out, exitreason='none',
    last-rc-change='Thu May 19 00:53:56 2016', queued=0ms, exec=0ms
* nfs-grace_monitor_5000 on dhcp46-247.lab.eng.blr.redhat.com 'unknown error' (1): call=46, status=Timed Out, exitreason='none',
    last-rc-change='Wed May 18 23:54:22 2016', queued=0ms, exec=0ms
* nfs-mon_monitor_10000 on dhcp46-247.lab.eng.blr.redhat.com 'unknown error' (1): call=21, status=Timed Out, exitreason='none',
    last-rc-change='Wed May 18 23:54:27 2016', queued=0ms, exec=0ms
* nfs-mon_monitor_10000 on dhcp47-139.lab.eng.blr.redhat.com 'unknown error' (1): call=21, status=Timed Out, exitreason='none',
    last-rc-change='Thu May 19 00:53:57 2016', queued=0ms, exec=0ms


6. Below messages are observed in ganesha.log

8/05/2016 23:49:15 : epoch 4cd30000 : dhcp46-247.lab.eng.blr.redhat.com : ganesha.nfsd-31387[work-6] nfs4_add_clid :CLIENT ID :EVENT :Failed to create client in recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.206-(86:0x4c696e7578204e465376342e302031302e37302e33372e3230362f31302e37302e34342e393220746370)), errno=107
18/05/2016 23:49:35 : epoch 4cd30000 : dhcp46-247.lab.eng.blr.redhat.com : ganesha.nfsd-31387[work-1] nfs4_add_clid :CLIENT ID :EVENT :Failed to create client in recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.136-(86:0x4c696e7578204e465376342e302031302e37302e33372e3133362f31302e37302e34342e393220746370)), errno=107
18/05/2016 23:50:43 : epoch 4cd30000 : dhcp46-247.lab.eng.blr.redhat.com : ganesha.nfsd-31387[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE
18/05/2016 23:50:43 : epoch 4cd30000 : dhcp46-247.lab.eng.blr.redhat.com : ganesha.nfsd-31387[reaper] nfs4_clean_old_recov_dir :CLIENT ID :EVENT :Failed to open old v4 recovery dir (/var/lib/nfs/ganesha/v4old), errno=107


18/05/2016 23:57:00 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs4_rm_revoked_handles :CLIENT ID :EVENT :opendir /var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.136-(86:0x4c696e7578204e465376342e302031302e37302e33372e3133362f31302e37302e34342e393220746370) failed errno=107
18/05/2016 23:57:00 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs4_rm_clid :CLIENT ID :EVENT :Failed to remove client recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.136-(86:0x4c696e7578204e465376342e302031302e37302e33372e3133362f31302e37302e34342e393220746370)), errno=107
18/05/2016 23:57:10 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs4_rm_revoked_handles :CLIENT ID :EVENT :opendir /var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.206-(86:0x4c696e7578204e465376342e302031302e37302e33372e3230362f31302e37302e34342e393220746370) failed errno=107
18/05/2016 23:57:10 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs4_rm_clid :CLIENT ID :EVENT :Failed to remove client recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.206-(86:0x4c696e7578204e465376342e302031302e37302e33372e3230362f31302e37302e34342e393220746370)), errno=107


18/05/2016 23:46:58 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/dhcp46-247.lab.eng.blr.redhat.com/v4recov)
18/05/2016 23:46:58 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Failed to open v4 recovery dir (/var/lib/nfs/ganesha/dhcp46-247.lab.eng.blr.redhat.com/v4recov), errno=107
18/05/2016 23:47:00 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
18/05/2016 23:47:06 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[work-5] nfs4_add_clid :CLIENT ID :EVENT :Failed to create client in recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.136-(86:0x4c696e7578204e465376342e302031302e37302e33372e3133362f31302e37302e34342e393220746370)), errno=107
18/05/2016 23:47:08 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[work-9] nfs4_add_clid :CLIENT ID :EVENT :Failed to create client in recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.206-(86:0x4c696e7578204e465376342e302031302e37302e33372e3230362f31302e37302e34342e393220746370)), errno=107

Actual results:

"nfs-grace-monitor" timed out messages observed in pcs status.

Expected results:

there should not be any failures in pcs status.

Additional info:

sosreports are attached for the reference.

Comment 2 Shashank Raj 2016-05-19 13:28:44 UTC

sosreports are placed at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1337565

Comment 5 Shashank Raj 2016-06-01 09:41:33 UTC

With the latest glusterfs-3.7.9-7 and nfs-ganesha-2.3.1-7 build, observed that after setting up ganesha on a RHEL 7 platform, below messages are observed in pcs status:

Failed Actions:
* nfs-grace_monitor_0 on dhcp46-26.lab.eng.blr.redhat.com 'unknown error' (1): call=16, status=complete, exitreason='none',
    last-rc-change='Wed Jun  1 12:01:00 2016', queued=0ms, exec=86ms
* nfs-grace_monitor_0 on dhcp47-139.lab.eng.blr.redhat.com 'unknown error' (1): call=16, status=complete, exitreason='none',
    last-rc-change='Wed Jun  1 12:00:59 2016', queued=0ms, exec=102ms
* nfs-grace_monitor_0 on dhcp46-202.lab.eng.blr.redhat.com 'unknown error' (1): call=16, status=complete, exitreason='none',
    last-rc-change='Wed Jun  1 12:00:59 2016', queued=0ms, exec=84ms
* nfs-grace_monitor_0 on dhcp46-247.lab.eng.blr.redhat.com 'unknown error' (1): call=16, status=complete, exitreason='none',
    last-rc-change='Wed Jun  1 11:53:59 2016', queued=0ms, exec=92ms

This doesn't effect any functionality, initial tests for failover and failback works fine.

Reporting this because these messages were not seen after ganesha setup with earlier glusterfs/ganesha builds.

Comment 6 Shashank Raj 2016-09-29 13:33:52 UTC

Kaleb,

Can we confirm this behavior?

Comment 7 Niels de Vos 2016-10-14 10:21:17 UTC

The error messages in comment #0 mention "errno=107". This is:

    ENOTCONN: Transport endpoint is not connected

We need to understand what side-effects this problem can have.

Comment 9 Kaleb KEITHLEY 2016-11-08 17:00:29 UTC

I'm not sure what I was being asked to confirm. That these warnings were benign?

I installed the latest downstream: glusterfs*-3.8.4-3.rhgs and nfs-ganesha-2.4.1-1.rhgs on RHEL7.3.  Did not observe any nfs-grace-monitor timed out messages. From this I conclude that the fix provided in rhgs-3.1.3 has in fact resolved the issue.  If this is seen again in testing please open a new BZ.

Comment 12 Atin Mukherjee 2016-11-17 13:15:45 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/90460/
Upstream change (master): http://review.gluster.org/#/c/15831 
Upstream change (3.9): http://review.gluster.org/#/c/15832
Upstream change (3.8): http://review.gluster.org/#/c/15833

Comment 15 RHEL Program Management 2016-12-15 04:22:46 UTC

Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 16 Manisha Saini 2016-12-15 09:25:00 UTC

Verified this on 

# rpm -qa | grep ganesha
nfs-ganesha-gluster-2.4.1-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-8.el7rhgs.x86_64
nfs-ganesha-2.4.1-2.el7rhgs.x86_64

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.3 (Maipo)

As the issue is no more observed,Hence marking this bug as verified.

Comment 18 errata-xmlrpc 2017-03-23 05:32:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.