Bug 1332047 - Ganesha service fails to restart after reboot with missing nfs folder under /var/lib
Summary: Ganesha service fails to restart after reboot with missing nfs folder under /...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-01 18:29 UTC by Shashank Raj
Modified: 2016-11-08 03:53 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-20 12:36:34 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Shashank Raj 2016-05-01 18:29:26 UTC
Description of problem:

Ganesha service fails to restart after reboot with missing nfs folder under /var/lib

Version-Release number of selected component (if applicable):

nfs-ganesha- 2.3.1-4

How reproducible:
 
Once

Steps to Reproduce:

Observed the issue after performing the below steps, however not sure, if these are the exact steps to reproduce it.

1. Create a 4 node cluster and configure ganesha on it.
2. Create a tiered volume, enable quota, attach tier and enable ganesha on it.
3. mount the volume with vers=3 on client.
4. Start creating IO from the mount point.
5. reboot the mounted node.
6. Once the node came back, observed below error messages in /var/log/ganesha.log

01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL
01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] lower_my_caps :NFS STARTUP :EVENT :currenty set capabilities are: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep
01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use)
01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue.
01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] unregister_fsal :FSAL :CRIT :Unregister FSAL GLUSTER with non-zero refcount=1
01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] glusterfs_unload :FSAL :CRIT :FSAL Gluster unable to unload.  Dying ...

7. Restarting nfs-ganesha service on this node fails everytime with below messages in /var/log/messages and observed that the nfs folder inside /var/lib and the symlink to shared volume is gone.

May  1 13:15:07 dhcp37-141 systemd: Starting Process NFS-Ganesha configuration...
May  1 13:15:07 dhcp37-141 systemd: Starting NFS status monitor for NFSv2/3 locking....
May  1 13:15:07 dhcp37-141 systemd: Started Process NFS-Ganesha configuration.
May  1 13:15:07 dhcp37-141 rpc.statd[11553]: Version 1.3.0 starting
May  1 13:15:07 dhcp37-141 rpc.statd[11553]: Flags: TI-RPC
May  1 13:15:07 dhcp37-141 rpc.statd[11553]: Failed to open directory sm: No such file or directory
May  1 13:15:07 dhcp37-141 rpc.statd[11553]: Initializing NSM state
May  1 13:15:07 dhcp37-141 rpc.statd[11553]: Failed to create /var/lib/nfs/statd/state.new: No such file or directory
May  1 13:15:07 dhcp37-141 systemd: nfs-ganesha-lock.service: control process exited, code=exited status=1
May  1 13:15:07 dhcp37-141 systemd: Failed to start NFS status monitor for NFSv2/3 locking..
May  1 13:15:07 dhcp37-141 systemd: Unit nfs-ganesha-lock.service entered failed state.
May  1 13:15:07 dhcp37-141 systemd: nfs-ganesha-lock.service failed.
May  1 13:15:07 dhcp37-141 systemd: Starting NFS-Ganesha file server...
May  1 13:15:07 dhcp37-141 systemd: Started NFS-Ganesha file server.
May  1 13:15:09 dhcp37-141 systemd: nfs-ganesha.service: main process exited, code=killed, status=6/ABRT
May  1 13:15:09 dhcp37-141 systemd: Unit nfs-ganesha.service entered failed state.
May  1 13:15:09 dhcp37-141 systemd: nfs-ganesha.service failed.

[root@dhcp37-141 ~]# ls -ltr /var/lib/ | grep nfs
drwxr-xr-x.  6 root      root     4096 Apr  1 03:14 nfs.backup


while on other nodes, its still present:

[root@dhcp37-158 ~]# ls -ltr /var/lib/ | grep nfs
drwxr-xr-x.  6 root      root     4096 Apr  1 02:44 nfs.backup
lrwxrwxrwx.  1 root      root       81 May  1 09:20 nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-158.lab.eng.blr.redhat.com/nfs
[root@dhcp37-158 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May  1 09:20 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-158.lab.eng.blr.redhat.com/nfs

[root@dhcp37-127 ~]# ls -ltr /var/lib/ | grep nfs
drwxr-xr-x.  6 root      root     4096 Apr  1 03:14 nfs.backup
lrwxrwxrwx.  1 root      root       81 Apr 30 16:20 nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-127.lab.eng.blr.redhat.com/nfs
[root@dhcp37-127 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 Apr 30 16:20 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-127.lab.eng.blr.redhat.com/nfs


[root@dhcp37-174 ~]# ls -ltr /var/lib/ | grep nfs
drwxr-xr-x.  6 root      root     4096 Apr  1 03:14 nfs.backup
lrwxrwxrwx.  1 root      root       81 May  1 09:20 nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-174.lab.eng.blr.redhat.com/nfs
[root@dhcp37-174 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May  1 09:20 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-174.lab.eng.blr.redhat.com/nfs


Actual results:

Ganesha service fails to restart after reboot with missing nfs folder under /var/lib

Expected results:

ganesha service should start properly after reboot.

Additional info:

Comment 2 Shashank Raj 2016-05-01 18:34:20 UTC
sosreport of the node can be found under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1332047

Comment 3 Soumya Koduri 2016-05-02 08:49:14 UTC
From ganesha.log,


02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use)
02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue.
02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] unregister_fsal :FSAL :CRIT :Unregister FSAL GLUSTER with non-zero refcount=1
02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] glusterfs_unload :FSAL :CRIT :FSAL Gluster unable to unload.  Dying ...



Rquota port (875) was already in use by some other process. 

[root@dhcp37-180 ~]#  netstat -ntaunlp | grep 875
tcp        0      0 10.70.37.180:875        10.70.37.127:24007      TIME_WAIT   - 

It doesn't list any process pid which is using this port, but 

Looks like the port was being used by one of the processes which is/was connected to glusterd port. But seems very strange why its pid is not being listed in the above netstat command. 


However when configured a different port for RQuota in '/etc/ganesha/ganesha.conf', nfs-ganesha process has got started.



NFS_Core_Param {
        #Use supplied name other tha IP In NSM operations
        NSM_Use_Caller_Name = true;
        #Copy lock states into "/var/lib/nfs/ganesha" dir
        Clustered = false;
        #By default port number '2049' is used for NFS service.
        #Configure ports for MNT, NLM, RQuota services.
        #The ports chosen here are from '/etc/sysconfig/nfs'
        MNT_Port = 20048;
        NLM_Port = 32803;
        Rquota_Port = 8750;
}

%include "/etc/ganesha/exports/export.tiervolume.conf"
[root@dhcp37-180 ~]# 


[root@dhcp37-180 ~]# showmount -e localhost
Export list for localhost:
/tiervolume (everyone)
[root@dhcp37-180 ~]# 


This seems like a known issue. Since the ports which we use for NLM/RQuota are not registered, we could occasionally run into these issues if there is any other process using them. We need to document to configure a different port in such cases, open that port via firewalld and then start nfs-ganesha. Please open another bug to track this issue.


This bug can be used to track why '/var/lib/nfs' link has been missing. Thanks!

Comment 4 Soumya Koduri 2016-05-02 09:27:39 UTC
With respect to /var/lib/nfs folder missing, I tried to re-create the issue but not able to reproduce. 

One thing to note here is while setting up ganesha, we move existing /var/lib/nfs to /var/lib/nfs.backup folder and create a link to '/var/lib/nfs' to a folder in our shared_storage. While tearing down the ganesha setup, we restore back /var/lib/nfs/backup to '/var/lib/nfs'. Since current ganesha ocf scripts check for the presence of '/var/lib/nfs' before taking any action, if by an chance that folder link is removed by any other process, it shall leave the folder as is both during setup and teardown.

I request Shashank to keep monitoring the state of '/var/lib/nfs' and provide definite steps of reproducing the issue.

Comment 5 Soumya Koduri 2016-05-04 11:40:30 UTC
(In reply to Soumya Koduri from comment #4)
> With respect to /var/lib/nfs folder missing, I tried to re-create the issue
> but not able to reproduce. 
> 
> One thing to note here is while setting up ganesha, we move existing
> /var/lib/nfs to /var/lib/nfs.backup folder and create a link to
> '/var/lib/nfs' to a folder in our shared_storage. While tearing down the
> ganesha setup, we restore back /var/lib/nfs/backup to '/var/lib/nfs'. Since

Sorry for the typo above. Its /var/lib/nfs.backup.

Comment 6 Kaleb KEITHLEY 2016-06-20 12:36:34 UTC
not seen in 3.1.3 testing.  reopen if necessary for 3.2.


Note You need to log in before you can comment on or make changes to this bug.