Bug 1339090 - During failback, nodes other than failed back node do not enter grace period
Summary: During failback, nodes other than failed back node do not enter grace period
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHGS 3.1.3
Assignee: Kaleb KEITHLEY
QA Contact: Shashank Raj
URL:
Whiteboard:
Depends On:
Blocks: 1311817
TreeView+ depends on / blocked
 
Reported: 2016-05-24 06:47 UTC by Shashank Raj
Modified: 2016-11-08 03:52 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.7.9-7
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2016-06-23 05:23:59 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1240 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.1 Update 3 2016-06-23 08:51:28 UTC

Description Shashank Raj 2016-05-24 06:47:49 UTC
Description of problem:
During failback, nodes other than failed back node do not enter grace period

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-5 
nfs-ganesha-2.3.1-7

How reproducible:
Always

Steps to Reproduce:
1. During failback, only the failed back node goes into grace period and all the other nodes don't go into grace period.

>>>>> Only on the failed back node, observed below messages which reflects its going into grace period and on all the other nodes, no such messages are seen.

23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
23/05/2016 18:33:34 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE

>>>>> Confirmed this by below scenario:

Assigned VIP's to nodes:

VIP_dhcp42-20.lab.eng.blr.redhat.com="10.70.40.205"
VIP_dhcp42-239.lab.eng.blr.redhat.com="10.70.40.206"
VIP_dhcp43-175.lab.eng.blr.redhat.com="10.70.40.207"
VIP_dhcp42-196.lab.eng.blr.redhat.com="10.70.40.208"

>> On client 1, mount the volume using VIP 10.70.40.205
>> On client 2, mount the volume using VIP 10.70.40.206
>> Start IO's from both the mount points
>> Stop ganesha service on node1 and observe that failover happens, nodes goes into grace period and the IO's from both the mount point remains blocked during that time frame.
>> wait for the IO's to resume
>> Start ganesha service on the node1 and observe that failback happens, however only node1 goes into grace period.
>> IO's from client1 remains blocked during that time period but from client 2, it keeps continuing.

Expected results:

All the nodes should be in grace period for 90 seconds during a failback.

Comment 2 Shashank Raj 2016-05-24 07:28:21 UTC
Since during failback, all the nodes do not enter grace period and IO keeps happening during that time, this needs to be a part of 3.1.3. hence raising a blocker flag for this bug.

Comment 3 Soumya Koduri 2016-05-24 07:38:38 UTC
Fix (one-liner) has been posted upstream for review - http://review.gluster.org/14506

Comment 7 Shashank Raj 2016-06-01 09:36:47 UTC
Verified this bug on RHEL 7 platform with latest glusterfs-3.7.9-7 and nfs-ganesha-2.3.1-7 build and its working as expected.

During failback, all the nodes in the cluster goes into grace period and any IO's which are started during that time-frame is stopped and resumes once the grace period is completed.

Based on the above observation, marking this bug as Verified.

Comment 9 errata-xmlrpc 2016-06-23 05:23:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240


Note You need to log in before you can comment on or make changes to this bug.