1394224 – "nfs-grace-monitor" timed out messages observed

Bug 1394224 - "nfs-grace-monitor" timed out messages observed

Summary: "nfs-grace-monitor" timed out messages observed

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	common-ha
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kaleb KEITHLEY
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1394225 1394226
TreeView+	depends on / blocked

Reported:	2016-11-11 12:45 UTC by Kaleb KEITHLEY
Modified:	2017-03-06 17:33 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.10.0
Clone Of:	1337565
Clones:	1394225 (view as bug list)
Environment:
Last Closed:	2017-01-24 11:50:23 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kaleb KEITHLEY 2016-11-11 12:45:06 UTC

+++ This bug was initially created as a clone of Bug #1337565 +++

Description of problem:

"nfs-grace-monitor" timed out messages observed in pcs status.

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-5.el7rhgs.x86_64
nfs-ganesha-2.3.1-7.el7rhgs.x86_64

How reproducible:
2/2

Steps to Reproduce:
1.Create a 4 node ganesha cluster.
2.Create a volume, enable ganesha and mount the volume with vers=4.
3.Start executing iozone test suite on the mount point.
4.During the test, stop nfs-ganesha service on the mounted node and once the IO resumes from other node, after sometime start the ganesha service on that node.
5.Check the pcs status and observe that there are timed out messages for nfs-grace-monitor and nfs-mon-monitor as below:

Failed Actions:
* nfs-mon_monitor_10000 on dhcp46-26.lab.eng.blr.redhat.com 'unknown error' (1): call=21, status=Timed Out, exitreason='none',
    last-rc-change='Thu May 19 18:35:08 2016', queued=0ms, exec=0ms
* nfs-grace_monitor_5000 on dhcp46-202.lab.eng.blr.redhat.com 'unknown error' (1): call=22, status=Timed Out, exitreason='none',
    last-rc-change='Thu May 19 00:53:52 2016', queued=0ms, exec=0ms
* nfs-mon_monitor_10000 on dhcp46-202.lab.eng.blr.redhat.com 'unknown error' (1): call=21, status=Timed Out, exitreason='none',
    last-rc-change='Thu May 19 00:53:56 2016', queued=0ms, exec=0ms
* nfs-grace_monitor_5000 on dhcp46-247.lab.eng.blr.redhat.com 'unknown error' (1): call=46, status=Timed Out, exitreason='none',
    last-rc-change='Wed May 18 23:54:22 2016', queued=0ms, exec=0ms
* nfs-mon_monitor_10000 on dhcp46-247.lab.eng.blr.redhat.com 'unknown error' (1): call=21, status=Timed Out, exitreason='none',
    last-rc-change='Wed May 18 23:54:27 2016', queued=0ms, exec=0ms
* nfs-mon_monitor_10000 on dhcp47-139.lab.eng.blr.redhat.com 'unknown error' (1): call=21, status=Timed Out, exitreason='none',
    last-rc-change='Thu May 19 00:53:57 2016', queued=0ms, exec=0ms


6. Below messages are observed in ganesha.log

8/05/2016 23:49:15 : epoch 4cd30000 : dhcp46-247.lab.eng.blr.redhat.com : ganesha.nfsd-31387[work-6] nfs4_add_clid :CLIENT ID :EVENT :Failed to create client in recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.206-(86:0x4c696e7578204e465376342e302031302e37302e33372e3230362f31302e37302e34342e393220746370)), errno=107
18/05/2016 23:49:35 : epoch 4cd30000 : dhcp46-247.lab.eng.blr.redhat.com : ganesha.nfsd-31387[work-1] nfs4_add_clid :CLIENT ID :EVENT :Failed to create client in recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.136-(86:0x4c696e7578204e465376342e302031302e37302e33372e3133362f31302e37302e34342e393220746370)), errno=107
18/05/2016 23:50:43 : epoch 4cd30000 : dhcp46-247.lab.eng.blr.redhat.com : ganesha.nfsd-31387[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE
18/05/2016 23:50:43 : epoch 4cd30000 : dhcp46-247.lab.eng.blr.redhat.com : ganesha.nfsd-31387[reaper] nfs4_clean_old_recov_dir :CLIENT ID :EVENT :Failed to open old v4 recovery dir (/var/lib/nfs/ganesha/v4old), errno=107


18/05/2016 23:57:00 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs4_rm_revoked_handles :CLIENT ID :EVENT :opendir /var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.136-(86:0x4c696e7578204e465376342e302031302e37302e33372e3133362f31302e37302e34342e393220746370) failed errno=107
18/05/2016 23:57:00 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs4_rm_clid :CLIENT ID :EVENT :Failed to remove client recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.136-(86:0x4c696e7578204e465376342e302031302e37302e33372e3133362f31302e37302e34342e393220746370)), errno=107
18/05/2016 23:57:10 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs4_rm_revoked_handles :CLIENT ID :EVENT :opendir /var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.206-(86:0x4c696e7578204e465376342e302031302e37302e33372e3230362f31302e37302e34342e393220746370) failed errno=107
18/05/2016 23:57:10 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs4_rm_clid :CLIENT ID :EVENT :Failed to remove client recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.206-(86:0x4c696e7578204e465376342e302031302e37302e33372e3230362f31302e37302e34342e393220746370)), errno=107


18/05/2016 23:46:58 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/dhcp46-247.lab.eng.blr.redhat.com/v4recov)
18/05/2016 23:46:58 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Failed to open v4 recovery dir (/var/lib/nfs/ganesha/dhcp46-247.lab.eng.blr.redhat.com/v4recov), errno=107
18/05/2016 23:47:00 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
18/05/2016 23:47:06 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[work-5] nfs4_add_clid :CLIENT ID :EVENT :Failed to create client in recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.136-(86:0x4c696e7578204e465376342e302031302e37302e33372e3133362f31302e37302e34342e393220746370)), errno=107
18/05/2016 23:47:08 : epoch 00670000 : dhcp46-202.lab.eng.blr.redhat.com : ganesha.nfsd-25624[work-9] nfs4_add_clid :CLIENT ID :EVENT :Failed to create client in recovery dir (/var/lib/nfs/ganesha/v4recov/::ffff:10.70.37.206-(86:0x4c696e7578204e465376342e302031302e37302e33372e3230362f31302e37302e34342e393220746370)), errno=107

Actual results:

"nfs-grace-monitor" timed out messages observed in pcs status.

Expected results:

there should not be any failures in pcs status.

Additional info:

sosreports are attached for the reference.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-05-19 09:24:57 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Shashank Raj on 2016-05-19 09:28:44 EDT ---

sosreports are placed at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1337565

--- Additional comment from Kaleb KEITHLEY on 2016-05-19 09:48:53 EDT ---

something is timing out, but it's not `pcs status`.

None of the /usr/lib/ocfs/resource.d/heartbeat/ganesha_* resource agents run a `pcs status`

--- Additional comment from Kaleb KEITHLEY on 2016-05-19 10:16:47 EDT ---

sorry, I misread the original. 

Now I see that a `pcs status` run by admin is showing Failed Actions for nfs-mon_monitor and nfs-grace_monitor with status=Timed Out.

The last ones occurred at May 19 00:53:57 (UTC?)  It is now 14:00:00 UTC (19:30:00 IST), so they are not happening continuously.

It would certainly be better if these did not happen, but I think they're mainly benign and can be reasonably safely ignored.

--- Additional comment from Shashank Raj on 2016-06-01 05:41:33 EDT ---

With the latest glusterfs-3.7.9-7 and nfs-ganesha-2.3.1-7 build, observed that after setting up ganesha on a RHEL 7 platform, below messages are observed in pcs status:

Failed Actions:
* nfs-grace_monitor_0 on dhcp46-26.lab.eng.blr.redhat.com 'unknown error' (1): call=16, status=complete, exitreason='none',
    last-rc-change='Wed Jun  1 12:01:00 2016', queued=0ms, exec=86ms
* nfs-grace_monitor_0 on dhcp47-139.lab.eng.blr.redhat.com 'unknown error' (1): call=16, status=complete, exitreason='none',
    last-rc-change='Wed Jun  1 12:00:59 2016', queued=0ms, exec=102ms
* nfs-grace_monitor_0 on dhcp46-202.lab.eng.blr.redhat.com 'unknown error' (1): call=16, status=complete, exitreason='none',
    last-rc-change='Wed Jun  1 12:00:59 2016', queued=0ms, exec=84ms
* nfs-grace_monitor_0 on dhcp46-247.lab.eng.blr.redhat.com 'unknown error' (1): call=16, status=complete, exitreason='none',
    last-rc-change='Wed Jun  1 11:53:59 2016', queued=0ms, exec=92ms

This doesn't effect any functionality, initial tests for failover and failback works fine.

Reporting this because these messages were not seen after ganesha setup with earlier glusterfs/ganesha builds.

--- Additional comment from Shashank Raj on 2016-09-29 09:33:52 EDT ---

Kaleb,

Can we confirm this behavior?

--- Additional comment from Niels de Vos on 2016-10-14 06:21:17 EDT ---

The error messages in comment #0 mention "errno=107". This is:

    ENOTCONN: Transport endpoint is not connected

We need to understand what side-effects this problem can have.

--- Additional comment from John Skeoch on 2016-11-07 22:54:01 EST ---

User sraj's account has been closed

--- Additional comment from Kaleb KEITHLEY on 2016-11-08 12:00:29 EST ---

I'm not sure what I was being asked to confirm. That these warnings were benign?

I installed the latest downstream: glusterfs*-3.8.4-3.rhgs and nfs-ganesha-2.4.1-1.rhgs on RHEL7.3.  Did not observe any nfs-grace-monitor timed out messages. From this I conclude that the fix provided in rhgs-3.1.3 has in fact resolved the issue.  If this is seen again in testing please open a new BZ.

--- Additional comment from Ambarish on 2016-11-09 13:33:52 EST ---

Hitting this issue on 3.8.4-3 gluster bits and 2.4.1-1 Ganesha.


I had restarted Ganesha,glusterd and rpcbind services post setting up the cluster.


[root@gqas013 tmp]# pcs status
Cluster name: G1474623742.03
Stack: corosync
Current DC: gqas013.sbu.lab.eng.bos.redhat.com (version 1.1.15-11.el7-e174ec8) - partition with quorum
Last updated: Wed Nov  9 13:28:59 2016		Last change: Wed Nov  9 12:12:41 2016 by root via crm_attribute on gqas011.sbu.lab.eng.bos.redhat.com

4 nodes and 16 resources configured

Online: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ]
 gqas013.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas013.sbu.lab.eng.bos.redhat.com
 gqas005.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas005.sbu.lab.eng.bos.redhat.com
 gqas006.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas006.sbu.lab.eng.bos.redhat.com
 gqas011.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas011.sbu.lab.eng.bos.redhat.com

Failed Actions:
* nfs-grace_monitor_0 on gqas013.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=18, status=complete, exitreason='none',
    last-rc-change='Wed Nov  9 12:08:14 2016', queued=0ms, exec=65ms
* nfs-grace_monitor_0 on gqas011.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=18, status=complete, exitreason='none',
    last-rc-change='Wed Nov  9 12:08:14 2016', queued=0ms, exec=64ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled
[root@gqas013 tmp]# 



Logs,core dump copied here :

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1393526

--- Additional comment from Ambarish on 2016-11-10 06:46:41 EST ---

One thing I'd like to add to my previous comment is that it is fairly reproducible this time,with the latest builds.
I hit this four times,out of the 5 times I tried.

Comment 1 Worker Ant 2016-11-11 13:15:30 UTC

REVIEW: http://review.gluster.org/15831 (common-ha: nfs-grace-monitor timed out unknown error messages) posted (#1) for review on master by Kaleb KEITHLEY (kkeithle)

Comment 2 Worker Ant 2016-11-16 10:26:44 UTC

COMMIT: http://review.gluster.org/15831 committed in master by Kaleb KEITHLEY (kkeithle) 
------
commit b5387f90f942bfc7cf0f7f969740f8bf1c911b80
Author: Kaleb S. KEITHLEY <kkeithle>
Date:   Fri Nov 11 08:06:03 2016 -0500

    common-ha: nfs-grace-monitor timed out unknown error messages
    
    grace_mon_monitor() occasionally returns OCF_ERR_GENERIC, but it
    ought to return OCF_NOT_RUNNING instead.
    
    Change-Id: I3d550e33cc3d0a8dce4333ced72db19b3b2f4f2e
    BUG: 1394224
    Signed-off-by: Kaleb S. KEITHLEY <kkeithle>
    Reviewed-on: http://review.gluster.org/15831
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: soumya k <skoduri>
    Reviewed-by: jiffin tony Thottan <jthottan>

Comment 3 Kaleb KEITHLEY 2017-01-24 11:50:23 UTC

bug is not relevant after 3.10 branch, fix on on release-3.9 branch

Comment 4 Shyamsundar 2017-03-06 17:33:37 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.