Bug 1566569

Summary: nfs-ganesha not failing back post reboot on setup deployed by colonizer
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: gluster-colonizerAssignee: Dustin Black <dblack>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Rahul Hinduja <rhinduja>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: dblack, nchilaka, pasik, rhs-bugs, rreddy
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-20 05:34:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1551186    
Bug Blocks:    

Description Nag Pavan Chilakam 2018-04-12 14:23:42 UTC
Description of problem:
----------------------
I deployed a nfsganesha+gpnas setup from colonizer successfully.
When a node is rebooted the VIP failovers to the one of the remaining nodes, as expected.
However, once the node is up, the VIP must failback over to the node.
However, this is not happening for a ganesha cluster deployed using colonizer

Seem like the mistake is due to the fact that we are making multiple duplicate entries in below conf file


[root@g1-1 ~]# #tail /run/gluster/shared_storage/nfs-ganesha/ganesha.conf 
%include "/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf"
%include "/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf"
%include "/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf"
%include "/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf"

We must be making only one entry


Version-Release number of selected component (if applicable):
----------------
colonizer-1.1-2

How reproducible:
--------------
always

Steps to Reproduce:
1.setup a gpnas+ganesha deployment on 4 node through colonizer
2.after setup is successful, mount the arbiter volume on nfs protocol using one of the vips (say of node3)
3.now reboot node3

Actual results:
--------------
the vip fails over to another node successfully, but doenst failback to node3 post it coming online

Comment 2 Dustin Black 2018-04-13 17:59:18 UTC
(In reply to nchilaka from comment #0)
> Seem like the mistake is due to the fact that we are making multiple
> duplicate entries in below conf file
> 
> 
> [root@g1-1 ~]# #tail /run/gluster/shared_storage/nfs-ganesha/ganesha.conf 
> %include
> "/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf"
> %include
> "/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf"
> %include
> "/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf"
> %include
> "/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf"

I can't reproduce this situation in my lab -- there is only a single include entry in the ganesha.conf file after a NFS deployment on the GP-NAS configuration.

I also don't see what in the gluster-colonizer could lead to this. Adding that include line is not an operation performed by the colonizer plays, but rather the result of the standard nfs-ganesha enablement. It's conceivably possible, though, that a play that should execute on only one node is instead executing on multiple nodes, leading to this result.

Can you provide the gluster-colonizer.log file from a run that included this effect of multiple entries in the ganesha.conf file?

Comment 3 Dustin Black 2018-04-13 18:07:45 UTC
Lab tests also show that fail-back is working when the shutdown node comes back online. We'll need more information about the problem specifics to triage this further.

Comment 6 Dustin Black 2018-06-20 18:06:56 UTC
This has popped up in another physical lab. I've tried to step through the the playbooks to watch for where the problem happens, but when I do, it doesn't happen at all. So the reproducer for the problem is elusive.

One option is to simply proactively mitigate the problem with an additional play or two.