Description of problem: ======================= as part of nfs-ganesha deployment using colonizer, nfs-ganesha service is not getting on other nodes apart from the node where colonizer script is being executed. I was trying to deploy NAS+NFS Due to this all the VIPs are being run on only the node of execution. However, I was able to start nfs-ganesha service manually without any trouble, when I tried on one of the other 3 nodes also note that symlinks are not created for /etc/ganesha/ganesha.conf to /var/run/gluster/shared_storage/nfs-ganesha/ Version-Release number of selected component (if applicable): ======================================================= see attached screenshot How reproducible: ========= 2/2 Steps to Reproduce: 1.choose nas deployment with nfs 2. let deployment complete successfully 3.check "showmount -e" "service nfs-ganesha status" Actual results: ============== nfs-ganesha status will show as dead
Please attach the gluster-colonizer.log file and a sosreport from one of the nodes where the nfs-ganesha service failed to start.
Created attachment 1397864 [details] version
Created attachment 1397866 [details] other screenshots
Created attachment 1397869 [details] sosreport of failed node and gluster colonizer log and o/p txt file
This issue is really baffling me. The problem is consistently not present in my virtual machine lab, but it seems to be consistently present in the hardware labs. The documentation[1] for NFS-Ganesha does say that the nfs-ganesha systemd service should get enabled, which seems to be missing in our plays and is the likely fix. However, previous versions of the documentation _did not_ call for enabling this service, which likely fed the existing architecture. [1] https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/nfs#nfs_ganesha
Changes tested in VM lab, and things seem to be working as expected. Will test ASAP in hardware lab.
Tested the fix in the hardware lab, and it looks good. Will move towards upstream post.
Fix included in upstream merge commit 2ee0da594be5fd5f5259f15c06fb32a6cf8e3e63
Created attachment 1403024 [details] validation screenshots
Those symlinks are supposed to get created by the 'gluster nfs-ganesha enable' command. Per docs: Note : Enable command performs the following * create a symlink ganesha.conf in /etc/ganesha using ganesha.conf in shared storage * start nfs-ganesha process on nodes part of ganesha cluster * set up ha cluster and disable does the reversal of enable Also if gluster nfs-ganesha [enable/disable] fails of please check following logs * /var/log/glusterfs/glusterd.log * /var/log/messages (and grep for pcs commands) * /var/log/pcsd/pcsd.log This action is included in the playbook correctly, AFAICT: - name: Enable nfs-ganesha delegate_to: 127.0.0.1 run_once: true shell: gluster nfs-ganesha enable --mode=script register: result failed_when: - "'is already enabled' not in result.stderr" - "'success' not in result.stderr" - "'success' not in result.stdout" ignore_errors: yes - name: Pause for 30 seconds (takes a while to enable NFS Ganesha) pause: seconds=30 - name: Check NFS Ganesha status delegate_to: 127.0.0.1 run_once: true shell: /usr/libexec/ganesha/ganesha-ha.sh --status "{{ ha_base_dir}}" | \ grep 'Cluster HA Status' register: result ignore_errors: yes - name: Report NFS Ganesha status debug: msg={{ result.stdout }} verbosity=0 when: result.stderr == "" - name: Report NFS Ganesha status (If any errors) debug: msg={{ result.stderr }} verbosity=0 when: result.stderr != "" That set of plays is pulled almost verbatim from the gdeploy plays. Whatever is going on here is probably the original root cause.
the symlinks are created by the ganesha-ha.sh only on the node where ganesha-ha.sh is run. At first blush it looks like we need to do a foreach node in cluster do; ssh-to-node-and-create-symlinks done I need to investigate why this has not been an issue before now
I've just confirmed in my VM lab that the symlinks are getting created properly on all nodes there. So this is somehow isolated to our physical lab deployments.
Created attachment 1403268 [details] gluster log files
Manually running 'gluster nfs-ganesha disable --mode=script; gluster nfs-ganesha enable --mode=script' on a running system after deployment corrected the problem, with all nodes showing the symlinks. Kaleb suggested checking the glusterd logs for "creation of symlink ganesha.conf in /etc/ganesha failed" messages, but no such messages were found. Some interesting warnings regarding missing ganesha.so were seen. Log files attached.
(In reply to Dustin Black from comment #21) > Manually running 'gluster nfs-ganesha disable --mode=script; gluster > nfs-ganesha enable --mode=script' on a running system after deployment > corrected the problem, with all nodes showing the symlinks. > > Kaleb suggested checking the glusterd logs for "creation of symlink > ganesha.conf in /etc/ganesha failed" messages, but no such messages were > found. Some interesting warnings regarding missing ganesha.so were seen. > > Log files attached. manually running 'gluster nfs-ganesha disable --mode=script; gluster nfs-ganesha enable --mode=script' does corrects the problem Also, we need to re-enable ganesha option on the volume to mount using ganesha protocol ie "gluster v set gluster1 ganesha.enable on"
Upstream merge commit attempts to work around this problem by inserting ansible plays to create the missing symlinks. https://github.com/gluster/gluster-colonizer/commit/78b42c6151adeec509928f4e62fff01cc03f02b4
Fix smoke tested in the physical lab where the problem is reproducible, and it looks good at a glance.
on_qa validation: retested on gluster-colonizer-1.0.4-2.el7rhgs now the symlinks are getting created on all nodes ganesha is running on all nodes (see screenshot) hence moving to verified
Created attachment 1405466 [details] softlink created validation
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0477