1339090 – During failback, nodes other than failed back node do not enter grace period

Bug 1339090 - During failback, nodes other than failed back node do not enter grace period

Summary: During failback, nodes other than failed back node do not enter grace period

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Kaleb KEITHLEY
QA Contact:	Shashank Raj
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1311817
TreeView+	depends on / blocked

Reported:	2016-05-24 06:47 UTC by Shashank Raj
Modified:	2016-11-08 03:52 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.7.9-7
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2016-06-23 05:23:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description Shashank Raj 2016-05-24 06:47:49 UTC

Description of problem:
During failback, nodes other than failed back node do not enter grace period

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-5 
nfs-ganesha-2.3.1-7

How reproducible:
Always

Steps to Reproduce:
1. During failback, only the failed back node goes into grace period and all the other nodes don't go into grace period.

>>>>> Only on the failed back node, observed below messages which reflects its going into grace period and on all the other nodes, no such messages are seen.

23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
23/05/2016 18:33:34 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

>>>>> Confirmed this by below scenario:

Assigned VIP's to nodes:

VIP_dhcp42-20.lab.eng.blr.redhat.com="10.70.40.205"
VIP_dhcp42-239.lab.eng.blr.redhat.com="10.70.40.206"
VIP_dhcp43-175.lab.eng.blr.redhat.com="10.70.40.207"
VIP_dhcp42-196.lab.eng.blr.redhat.com="10.70.40.208"

>> On client 1, mount the volume using VIP 10.70.40.205
>> On client 2, mount the volume using VIP 10.70.40.206
>> Start IO's from both the mount points
>> Stop ganesha service on node1 and observe that failover happens, nodes goes into grace period and the IO's from both the mount point remains blocked during that time frame.
>> wait for the IO's to resume
>> Start ganesha service on the node1 and observe that failback happens, however only node1 goes into grace period.
>> IO's from client1 remains blocked during that time period but from client 2, it keeps continuing.

Expected results:

All the nodes should be in grace period for 90 seconds during a failback.

Comment 2 Shashank Raj 2016-05-24 07:28:21 UTC

Since during failback, all the nodes do not enter grace period and IO keeps happening during that time, this needs to be a part of 3.1.3. hence raising a blocker flag for this bug.

Comment 3 Soumya Koduri 2016-05-24 07:38:38 UTC

Fix (one-liner) has been posted upstream for review - http://review.gluster.org/14506

Comment 7 Shashank Raj 2016-06-01 09:36:47 UTC

Verified this bug on RHEL 7 platform with latest glusterfs-3.7.9-7 and nfs-ganesha-2.3.1-7 build and its working as expected.

During failback, all the nodes in the cluster goes into grace period and any IO's which are started during that time-frame is stopped and resumes once the grace period is completed.

Based on the above observation, marking this bug as Verified.

Comment 9 errata-xmlrpc 2016-06-23 05:23:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.