1329887 – Unexpected behavior observed when nfs-ganesha enters grace period.

Bug 1329887 - Unexpected behavior observed when nfs-ganesha enters grace period.

Summary: Unexpected behavior observed when nfs-ganesha enters grace period.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Soumya Koduri
QA Contact:	Shashank Raj
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1311817
TreeView+	depends on / blocked

Reported:	2016-04-24 11:33 UTC by Shashank Raj
Modified:	2016-11-08 03:52 UTC (History)
CC List:	8 users (show)
Fixed In Version:	nfs-ganesha-2.3.1-5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-06-23 05:33:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1276683	1	None	None	None	2024-09-18 00:43:48 UTC
Red Hat Product Errata	RHEA-2016:1247	0	normal	SHIPPED_LIVE	nfs-ganesha update for Red Hat Gluster Storage 3.1 update 3	2016-06-23 09:12:43 UTC

Internal Links: 1276683

Description Shashank Raj 2016-04-24 11:33:27 UTC

Description of problem:

During failover/failback, nfs-ganesha enters grace period only for 60 seconds and IO's get stopped for somewhere around 70-75 seconds


Version-Release number of selected component (if applicable):

[root@dhcp42-121 exports]# rpm -qa|grep glusterfs
glusterfs-api-3.7.9-2.el6rhs.x86_64
glusterfs-geo-replication-3.7.9-2.el6rhs.x86_64
glusterfs-libs-3.7.9-2.el6rhs.x86_64
glusterfs-client-xlators-3.7.9-2.el6rhs.x86_64
glusterfs-fuse-3.7.9-2.el6rhs.x86_64
glusterfs-server-3.7.9-2.el6rhs.x86_64
glusterfs-ganesha-3.7.9-2.el6rhs.x86_64
glusterfs-3.7.9-2.el6rhs.x86_64
glusterfs-cli-3.7.9-2.el6rhs.x86_64

[root@dhcp42-121 exports]# rpm -qa|grep ganesha
nfs-ganesha-gluster-2.3.1-4.el6rhs.x86_64
glusterfs-ganesha-3.7.9-2.el6rhs.x86_64
nfs-ganesha-2.3.1-4.el6rhs.x86_64

How reproducible:

Always

Steps to Reproduce:

1. Configure nfs-ganesha on a 4 node cluster.
2. Create a dist rep 6x2 volume, start it and enable ganesha on it.
3. Mount the volume on client using VIP with version 4.
4. Start IO on the mount point.
5. Perform failover and failback multiple times and observe the below behavior.\

During failover/failback, nfs-ganesha enters grace period only for 60 seconds as can be seen in below logs and IO's get stopped for somewhere around 70-75 seconds

Failover Scenario:

node 1 ganesha service stopped

Got grace messages in logs for all other nodes, IO Stopped and started in ~75 seconds.

On node2:

first attempt:

24/04/2016 21:02:29 : epoch 571a5c1a : dhcp42-119.lab.eng.blr.redhat.com : ganesha.nfsd-10909[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:03:29 : epoch 571a5c1a : dhcp42-119.lab.eng.blr.redhat.com : ganesha.nfsd-10909[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

second attempt:

24/04/2016 21:11:49 : epoch 571a5c1a : dhcp42-119.lab.eng.blr.redhat.com : ganesha.nfsd-10909[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:12:49 : epoch 571a5c1a : dhcp42-119.lab.eng.blr.redhat.com : ganesha.nfsd-10909[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

third attempt:

24/04/2016 21:22:49 : epoch 571a5c1a : dhcp42-119.lab.eng.blr.redhat.com : ganesha.nfsd-10909[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:23:49 : epoch 571a5c1a : dhcp42-119.lab.eng.blr.redhat.com : ganesha.nfsd-10909[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

On node3:

first attempt:

24/04/2016 21:02:28 : epoch 571a5c1a : dhcp42-128.lab.eng.blr.redhat.com : ganesha.nfsd-10916[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:03:28 : epoch 571a5c1a : dhcp42-128.lab.eng.blr.redhat.com : ganesha.nfsd-10916[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

second attempt:

24/04/2016 21:11:48 : epoch 571a5c1a : dhcp42-128.lab.eng.blr.redhat.com : ganesha.nfsd-10916[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:12:49 : epoch 571a5c1a : dhcp42-128.lab.eng.blr.redhat.com : ganesha.nfsd-10916[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

third attempt:
24/04/2016 21:22:49 : epoch 571a5c1a : dhcp42-128.lab.eng.blr.redhat.com : ganesha.nfsd-10916[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:23:49 : epoch 571a5c1a : dhcp42-128.lab.eng.blr.redhat.com : ganesha.nfsd-10916[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

On node4:

first attempt:

24/04/2016 21:02:28 : epoch 571a5c1b : dhcp43-59.lab.eng.blr.redhat.com : ganesha.nfsd-10911[work-5] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:03:24 : epoch 571a5c1b : dhcp43-59.lab.eng.blr.redhat.com : ganesha.nfsd-10911[work-6] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

Second attempt:

24/04/2016 21:11:49 : epoch 571a5c1b : dhcp43-59.lab.eng.blr.redhat.com : ganesha.nfsd-10911[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:12:49 : epoch 571a5c1b : dhcp43-59.lab.eng.blr.redhat.com : ganesha.nfsd-10911[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

third attempt:

24/04/2016 21:22:49 : epoch 571a5c1b : dhcp43-59.lab.eng.blr.redhat.com : ganesha.nfsd-10911[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:23:49 : epoch 571a5c1b : dhcp43-59.lab.eng.blr.redhat.com : ganesha.nfsd-10911[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE






Failback Scenario:

node1 service started. No grace messages in logs for other nodes except on node 1. IO stopped and started in around 70-75 seconds

First attempt:

24/04/2016 21:07:36 : epoch 571ce83f : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-582[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:07:36 : epoch 571ce83f : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-582[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
24/04/2016 21:07:36 : epoch 571ce83f : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-582[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
24/04/2016 21:07:36 : epoch 571ce83f : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-582[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
24/04/2016 21:07:36 : epoch 571ce83f : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-582[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
24/04/2016 21:08:36 : epoch 571ce83f : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-582[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

Second attempt:

24/04/2016 21:16:01 : epoch 571cea38 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-2522[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:16:01 : epoch 571cea38 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-2522[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
24/04/2016 21:16:01 : epoch 571cea38 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-2522[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
24/04/2016 21:16:01 : epoch 571cea38 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-2522[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
24/04/2016 21:16:01 : epoch 571cea38 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-2522[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
24/04/2016 21:16:01 : epoch 571cea38 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-2522[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
24/04/2016 21:17:01 : epoch 571cea38 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-2522[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

third attempt:

24/04/2016 21:26:50 : epoch 571cecc1 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-5221[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
24/04/2016 21:26:50 : epoch 571cecc1 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-5221[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
24/04/2016 21:26:50 : epoch 571cecc1 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-5221[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
24/04/2016 21:26:50 : epoch 571cecc1 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-5221[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
24/04/2016 21:27:50 : epoch 571cecc1 : dhcp42-121.lab.eng.blr.redhat.com : ganesha.nfsd-5221[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE


Actual results:


So we have 2 issues here,

1. nfs ganesha enters in grace period only for 60 seconds and IO's remains blocked for 70-75 seconds.
2. During failback, grace message is only seen on the failedback node and not on other nodes.

Expected results:

According to discussions with dev team, nfs-ganesha should enter into grace period for 90 seconds and IO's should be blocked during that time.

Proper grace messages should be seen during failback on all the nodes. 


Additional info:

Also, tried restarting nfs-ganesha service when the volume is mounted using physical IP. NO grace messages are seen in that scenario and IO's get blocked for around 80 seconds.

Comment 4 Shashank Raj 2016-05-23 12:53:34 UTC

Verified this bug with glusterfs-3.7.9-5 and nfs-ganesha-2.3.1-7 build and out of two problems mentioned, only one seems to be working as expected.


Working as expected:

1.nfs-ganesha goes into grace period for 90 seconds during failover and all the IO's remains blocked during this period.

node2:

23/05/2016 17:52:46 : epoch 4e610000 : dhcp43-175.lab.eng.blr.redhat.com : ganesha.nfsd-10958[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
23/05/2016 17:52:46 : epoch 4e610000 : dhcp43-175.lab.eng.blr.redhat.com : ganesha.nfsd-10958[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip dhcp42-20.lab.eng.blr.redhat.com
23/05/2016 17:52:46 : epoch 4e610000 : dhcp43-175.lab.eng.blr.redhat.com : ganesha.nfsd-10958[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/dhcp42-20.lab.eng.blr.redhat.com/v4recov)
23/05/2016 17:52:54 : epoch 4e610000 : dhcp43-175.lab.eng.blr.redhat.com : ganesha.nfsd-10958[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
23/05/2016 17:54:24 : epoch 4e610000 : dhcp43-175.lab.eng.blr.redhat.com : ganesha.nfsd-10958[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

node3:

23/05/2016 17:52:46 : epoch 4a320000 : dhcp42-239.lab.eng.blr.redhat.com : ganesha.nfsd-10607[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
23/05/2016 17:52:46 : epoch 4a320000 : dhcp42-239.lab.eng.blr.redhat.com : ganesha.nfsd-10607[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip dhcp42-20.lab.eng.blr.redhat.com
23/05/2016 17:52:46 : epoch 4a320000 : dhcp42-239.lab.eng.blr.redhat.com : ganesha.nfsd-10607[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/dhcp42-20.lab.eng.blr.redhat.com/v4recov)
23/05/2016 17:52:56 : epoch 4a320000 : dhcp42-239.lab.eng.blr.redhat.com : ganesha.nfsd-10607[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
23/05/2016 17:54:16 : epoch 4a320000 : dhcp42-239.lab.eng.blr.redhat.com : ganesha.nfsd-10607[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

node4:

23/05/2016 17:52:46 : epoch 86800000 : dhcp42-196.lab.eng.blr.redhat.com : ganesha.nfsd-32707[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
23/05/2016 17:52:46 : epoch 86800000 : dhcp42-196.lab.eng.blr.redhat.com : ganesha.nfsd-32707[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip dhcp42-20.lab.eng.blr.redhat.com
23/05/2016 17:52:46 : epoch 86800000 : dhcp42-196.lab.eng.blr.redhat.com : ganesha.nfsd-32707[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/dhcp42-20.lab.eng.blr.redhat.com/v4recov)
23/05/2016 17:52:52 : epoch 86800000 : dhcp42-196.lab.eng.blr.redhat.com : ganesha.nfsd-32707[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
23/05/2016 17:54:17 : epoch 86800000 : dhcp42-196.lab.eng.blr.redhat.com : ganesha.nfsd-32707[work-15] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE




Not working as expected:

2.During failback, only the failed over node goes into grace period and all the other nodes doesn't go into grace period.

>>>>> Only on the failed back node, observed below messages which reflects its going into grace period and on all the other nodes, no such messages are seen.

23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
23/05/2016 18:33:34 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

>>>>> Confirmed this by below scenario:

Assigned VIP's to nodes:

VIP_dhcp42-20.lab.eng.blr.redhat.com="10.70.40.205"
VIP_dhcp42-239.lab.eng.blr.redhat.com="10.70.40.206"
VIP_dhcp43-175.lab.eng.blr.redhat.com="10.70.40.207"
VIP_dhcp42-196.lab.eng.blr.redhat.com="10.70.40.208"

>> On client 1, mount the volume using VIP 10.70.40.205
>> On client 2, mount the volume using VIP 10.70.40.206
>> Start IO's from both the mount points
>> Stop ganesha service on node1 and observe that failover happens, nodes goes into grace period and the IO's from both the mount point remains blocked during that time frame.
>> wait for the IO's to resume
>> Start ganesha service on the node1 and observe that failback happens, however only node1 goes into grace period.
>> IO's from client1 remains blocked during that time period but from client 2, it keeps continuing.

Expected results:

All the nodes should be in grace period for 90 seconds during a failback.

Comment 5 Shashank Raj 2016-05-24 06:57:51 UTC

As discussed with Dev team, the issue mentioned in comment4 (not working as expected) needs to be tracked under a separate bug. So a new bug has been filed with bug ID >> (https://bugzilla.redhat.com/show_bug.cgi?id=1339090).

Based on the observations in comment 4 (working as expected), since during failover and failback, ganesha is entering into grace period for 90 seconds, marking this bug as Verified.

Comment 7 errata-xmlrpc 2016-06-23 05:33:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1247

Note You need to log in before you can comment on or make changes to this bug.