1546748 – nfs-ganesha service getting started only on node where colonizer is being executed

Bug 1546748 - nfs-ganesha service getting started only on node where colonizer is being executed

Summary: nfs-ganesha service getting started only on node where colonizer is being exe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-colonizer
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.1 Async
Assignee:	Dustin Black
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1542835
TreeView+	depends on / blocked

Reported:	2018-02-19 13:51 UTC by Nag Pavan Chilakam
Modified:	2019-04-03 09:09 UTC (History)
CC List:	8 users (show)
Fixed In Version:	gluster-colonizer-1.0.4-1.el7rhgs
Doc Type:	Enhancement
Doc Text:	Feature: Reason: nfs-ganesha service is not getting on other nodes apart from the node where colonizer script is being executed. Result: nfs-ganesha systemd service should get enabled. NFs status shoudl be active.
Clone Of:
Environment:
Last Closed:	2018-03-12 12:06:14 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
version (150.90 KB, image/png) 2018-02-19 13:56 UTC, Nag Pavan Chilakam	no flags	Details
other screenshots (2.30 MB, application/zip) 2018-02-19 13:57 UTC, Nag Pavan Chilakam	no flags	Details
sosreport of failed node and gluster colonizer log and o/p txt file (10.15 MB, application/x-gzip) 2018-02-19 14:09 UTC, Nag Pavan Chilakam	no flags	Details
validation screenshots (2.13 MB, application/zip) 2018-03-02 13:27 UTC, Nag Pavan Chilakam	no flags	Details
gluster log files (8.09 KB, application/x-gzip) 2018-03-02 23:23 UTC, Dustin Black	no flags	Details
softlink created validation (166.90 KB, image/png) 2018-03-07 16:59 UTC, Nag Pavan Chilakam	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:0477	0	normal	SHIPPED_LIVE	RHEA: new package: gluster-colonizer	2018-03-12 18:12:47 UTC

Description Nag Pavan Chilakam 2018-02-19 13:51:31 UTC

Description of problem:
=======================
as part of nfs-ganesha deployment using colonizer, nfs-ganesha service is not getting on other nodes apart from the node where colonizer script is being executed. I was trying to deploy NAS+NFS
Due to this all the VIPs are being run on only the node of execution.

However, I was able to start nfs-ganesha service manually without any trouble, when I tried on one of the other 3 nodes

also note that symlinks are not created for /etc/ganesha/ganesha.conf to /var/run/gluster/shared_storage/nfs-ganesha/


Version-Release number of selected component (if applicable):
=======================================================
see attached screenshot

How reproducible:
=========
2/2

Steps to Reproduce:
1.choose nas deployment with nfs
2. let deployment complete successfully
3.check "showmount -e" "service nfs-ganesha status"

Actual results:
==============
nfs-ganesha status will show as dead

Comment 2 Dustin Black 2018-02-19 13:56:11 UTC

Please attach the gluster-colonizer.log file and a sosreport from one of the nodes where the nfs-ganesha service failed to start.

Comment 3 Nag Pavan Chilakam 2018-02-19 13:56:35 UTC

Created attachment 1397864 [details]
version

Comment 4 Nag Pavan Chilakam 2018-02-19 13:57:38 UTC

Created attachment 1397866 [details]
other screenshots

Comment 6 Nag Pavan Chilakam 2018-02-19 14:09:59 UTC

Created attachment 1397869 [details]
sosreport of failed node and gluster colonizer log and o/p txt file

Comment 7 Dustin Black 2018-02-20 23:57:44 UTC

This issue is really baffling me. The problem is consistently not present in my virtual machine lab, but it seems to be consistently present in the hardware labs.

The documentation[1] for NFS-Ganesha does say that the nfs-ganesha systemd service should get enabled, which seems to be missing in our plays and is the likely fix. However, previous versions of the documentation _did not_ call for enabling this service, which likely fed the existing architecture.

[1] https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/nfs#nfs_ganesha

Comment 8 Dustin Black 2018-02-21 19:10:44 UTC

Changes tested in VM lab, and things seem to be working as expected. Will test ASAP in hardware lab.

Comment 9 Dustin Black 2018-02-22 05:47:41 UTC

Tested the fix in the hardware lab, and it looks good. Will move towards upstream post.

Comment 10 Dustin Black 2018-02-22 17:19:17 UTC

Fix included in upstream merge commit 2ee0da594be5fd5f5259f15c06fb32a6cf8e3e63

Comment 15 Nag Pavan Chilakam 2018-03-02 13:27:00 UTC

Created attachment 1403024 [details]
validation screenshots

Comment 17 Dustin Black 2018-03-02 20:04:36 UTC

Those symlinks are supposed to get created by the 'gluster nfs-ganesha enable' command. Per docs:

Note :
Enable command performs the following
* create a symlink ganesha.conf in /etc/ganesha using ganesha.conf in shared storage
* start nfs-ganesha process on nodes part of ganesha cluster
* set up ha cluster
and disable does the reversal of enable
Also if gluster nfs-ganesha [enable/disable] fails of please check following logs
* /var/log/glusterfs/glusterd.log
* /var/log/messages (and grep for pcs commands)
* /var/log/pcsd/pcsd.log


This action is included in the playbook correctly, AFAICT:

- name: Enable nfs-ganesha
  delegate_to: 127.0.0.1
  run_once: true
  shell: gluster nfs-ganesha enable --mode=script
  register: result
  failed_when:
    - "'is already enabled' not in result.stderr"
    - "'success' not in result.stderr"
    - "'success' not in result.stdout"
  ignore_errors: yes

- name: Pause for 30 seconds (takes a while to enable NFS Ganesha)
  pause: seconds=30

- name: Check NFS Ganesha status
  delegate_to: 127.0.0.1
  run_once: true
  shell: /usr/libexec/ganesha/ganesha-ha.sh --status "{{ ha_base_dir}}" | \
         grep 'Cluster HA Status'
  register: result
  ignore_errors: yes

- name: Report NFS Ganesha status
  debug: msg={{ result.stdout }} verbosity=0
  when: result.stderr == ""

- name: Report NFS Ganesha status (If any errors)
  debug: msg={{ result.stderr }} verbosity=0
  when: result.stderr != ""


That set of plays is pulled almost verbatim from the gdeploy plays.

Whatever is going on here is probably the original root cause.

Comment 18 Kaleb KEITHLEY 2018-03-02 20:12:03 UTC

the symlinks are created by the ganesha-ha.sh only on the node where ganesha-ha.sh is run.

At first blush it looks like we need to do a
  foreach node in cluster do;
     ssh-to-node-and-create-symlinks
  done

I need to investigate why this has not been an issue before now

Comment 19 Dustin Black 2018-03-02 20:23:22 UTC

I've just confirmed in my VM lab that the symlinks are getting created properly on all nodes there. So this is somehow isolated to our physical lab deployments.

Comment 20 Dustin Black 2018-03-02 23:23:24 UTC

Created attachment 1403268 [details]
gluster log files

Comment 21 Dustin Black 2018-03-02 23:26:42 UTC

Manually running 'gluster nfs-ganesha disable --mode=script; gluster nfs-ganesha enable --mode=script' on a running system after deployment corrected the problem, with all nodes showing the symlinks.

Kaleb suggested checking the glusterd logs for "creation of symlink ganesha.conf in /etc/ganesha failed" messages, but no such messages were found. Some interesting warnings regarding missing ganesha.so were seen.

Log files attached.

Comment 22 Nag Pavan Chilakam 2018-03-05 16:46:51 UTC

(In reply to Dustin Black from comment #21)
> Manually running 'gluster nfs-ganesha disable --mode=script; gluster
> nfs-ganesha enable --mode=script' on a running system after deployment
> corrected the problem, with all nodes showing the symlinks.
> 
> Kaleb suggested checking the glusterd logs for "creation of symlink
> ganesha.conf in /etc/ganesha failed" messages, but no such messages were
> found. Some interesting warnings regarding missing ganesha.so were seen.
> 
> Log files attached.

manually running 'gluster nfs-ganesha disable --mode=script; gluster nfs-ganesha enable --mode=script'  does corrects the problem
Also, we need to re-enable ganesha option on the volume to mount using ganesha protocol ie "gluster v set gluster1 ganesha.enable on"

Comment 24 Dustin Black 2018-03-06 21:41:20 UTC

Upstream merge commit attempts to work around this problem by inserting ansible plays to create the missing symlinks.

https://github.com/gluster/gluster-colonizer/commit/78b42c6151adeec509928f4e62fff01cc03f02b4

Comment 25 Dustin Black 2018-03-06 22:44:54 UTC

Fix smoke tested in the physical lab where the problem is reproducible, and it looks good at a glance.

Comment 26 Nag Pavan Chilakam 2018-03-07 16:58:47 UTC

on_qa validation:
retested on gluster-colonizer-1.0.4-2.el7rhgs
now the symlinks are getting created on all nodes
ganesha is running on all nodes
(see screenshot)

hence moving to verified

Comment 27 Nag Pavan Chilakam 2018-03-07 16:59:26 UTC

Created attachment 1405466 [details]
softlink created validation

Comment 31 errata-xmlrpc 2018-03-12 12:06:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0477

Note You need to log in before you can comment on or make changes to this bug.