Description of problem: During failback, nodes other than failed back node do not enter grace period Version-Release number of selected component (if applicable): glusterfs-3.7.9-5 nfs-ganesha-2.3.1-7 How reproducible: Always Steps to Reproduce: 1. During failback, only the failed back node goes into grace period and all the other nodes don't go into grace period. >>>>> Only on the failed back node, observed below messages which reflects its going into grace period and on all the other nodes, no such messages are seen. 23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully 23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- 23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED 23/05/2016 18:32:04 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- 23/05/2016 18:33:34 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-19654[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE >>>>> Confirmed this by below scenario: Assigned VIP's to nodes: VIP_dhcp42-20.lab.eng.blr.redhat.com="10.70.40.205" VIP_dhcp42-239.lab.eng.blr.redhat.com="10.70.40.206" VIP_dhcp43-175.lab.eng.blr.redhat.com="10.70.40.207" VIP_dhcp42-196.lab.eng.blr.redhat.com="10.70.40.208" >> On client 1, mount the volume using VIP 10.70.40.205 >> On client 2, mount the volume using VIP 10.70.40.206 >> Start IO's from both the mount points >> Stop ganesha service on node1 and observe that failover happens, nodes goes into grace period and the IO's from both the mount point remains blocked during that time frame. >> wait for the IO's to resume >> Start ganesha service on the node1 and observe that failback happens, however only node1 goes into grace period. >> IO's from client1 remains blocked during that time period but from client 2, it keeps continuing. Expected results: All the nodes should be in grace period for 90 seconds during a failback.
Since during failback, all the nodes do not enter grace period and IO keeps happening during that time, this needs to be a part of 3.1.3. hence raising a blocker flag for this bug.
Fix (one-liner) has been posted upstream for review - http://review.gluster.org/14506
Verified this bug on RHEL 7 platform with latest glusterfs-3.7.9-7 and nfs-ganesha-2.3.1-7 build and its working as expected. During failback, all the nodes in the cluster goes into grace period and any IO's which are started during that time-frame is stopped and resumes once the grace period is completed. Based on the above observation, marking this bug as Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240