Hide Forgot
Description of problem: Ganesha service fails to restart after reboot with missing nfs folder under /var/lib Version-Release number of selected component (if applicable): nfs-ganesha- 2.3.1-4 How reproducible: Once Steps to Reproduce: Observed the issue after performing the below steps, however not sure, if these are the exact steps to reproduce it. 1. Create a 4 node cluster and configure ganesha on it. 2. Create a tiered volume, enable quota, attach tier and enable ganesha on it. 3. mount the volume with vers=3 on client. 4. Start creating IO from the mount point. 5. reboot the mounted node. 6. Once the node came back, observed below error messages in /var/log/ganesha.log 01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL 01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] lower_my_caps :NFS STARTUP :EVENT :currenty set capabilities are: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep 01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use) 01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue. 01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] unregister_fsal :FSAL :CRIT :Unregister FSAL GLUSTER with non-zero refcount=1 01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] glusterfs_unload :FSAL :CRIT :FSAL Gluster unable to unload. Dying ... 7. Restarting nfs-ganesha service on this node fails everytime with below messages in /var/log/messages and observed that the nfs folder inside /var/lib and the symlink to shared volume is gone. May 1 13:15:07 dhcp37-141 systemd: Starting Process NFS-Ganesha configuration... May 1 13:15:07 dhcp37-141 systemd: Starting NFS status monitor for NFSv2/3 locking.... May 1 13:15:07 dhcp37-141 systemd: Started Process NFS-Ganesha configuration. May 1 13:15:07 dhcp37-141 rpc.statd[11553]: Version 1.3.0 starting May 1 13:15:07 dhcp37-141 rpc.statd[11553]: Flags: TI-RPC May 1 13:15:07 dhcp37-141 rpc.statd[11553]: Failed to open directory sm: No such file or directory May 1 13:15:07 dhcp37-141 rpc.statd[11553]: Initializing NSM state May 1 13:15:07 dhcp37-141 rpc.statd[11553]: Failed to create /var/lib/nfs/statd/state.new: No such file or directory May 1 13:15:07 dhcp37-141 systemd: nfs-ganesha-lock.service: control process exited, code=exited status=1 May 1 13:15:07 dhcp37-141 systemd: Failed to start NFS status monitor for NFSv2/3 locking.. May 1 13:15:07 dhcp37-141 systemd: Unit nfs-ganesha-lock.service entered failed state. May 1 13:15:07 dhcp37-141 systemd: nfs-ganesha-lock.service failed. May 1 13:15:07 dhcp37-141 systemd: Starting NFS-Ganesha file server... May 1 13:15:07 dhcp37-141 systemd: Started NFS-Ganesha file server. May 1 13:15:09 dhcp37-141 systemd: nfs-ganesha.service: main process exited, code=killed, status=6/ABRT May 1 13:15:09 dhcp37-141 systemd: Unit nfs-ganesha.service entered failed state. May 1 13:15:09 dhcp37-141 systemd: nfs-ganesha.service failed. [root@dhcp37-141 ~]# ls -ltr /var/lib/ | grep nfs drwxr-xr-x. 6 root root 4096 Apr 1 03:14 nfs.backup while on other nodes, its still present: [root@dhcp37-158 ~]# ls -ltr /var/lib/ | grep nfs drwxr-xr-x. 6 root root 4096 Apr 1 02:44 nfs.backup lrwxrwxrwx. 1 root root 81 May 1 09:20 nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-158.lab.eng.blr.redhat.com/nfs [root@dhcp37-158 ~]# ls -ld /var/lib/nfs lrwxrwxrwx. 1 root root 81 May 1 09:20 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-158.lab.eng.blr.redhat.com/nfs [root@dhcp37-127 ~]# ls -ltr /var/lib/ | grep nfs drwxr-xr-x. 6 root root 4096 Apr 1 03:14 nfs.backup lrwxrwxrwx. 1 root root 81 Apr 30 16:20 nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-127.lab.eng.blr.redhat.com/nfs [root@dhcp37-127 ~]# ls -ld /var/lib/nfs lrwxrwxrwx. 1 root root 81 Apr 30 16:20 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-127.lab.eng.blr.redhat.com/nfs [root@dhcp37-174 ~]# ls -ltr /var/lib/ | grep nfs drwxr-xr-x. 6 root root 4096 Apr 1 03:14 nfs.backup lrwxrwxrwx. 1 root root 81 May 1 09:20 nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-174.lab.eng.blr.redhat.com/nfs [root@dhcp37-174 ~]# ls -ld /var/lib/nfs lrwxrwxrwx. 1 root root 81 May 1 09:20 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp37-174.lab.eng.blr.redhat.com/nfs Actual results: Ganesha service fails to restart after reboot with missing nfs folder under /var/lib Expected results: ganesha service should start properly after reboot. Additional info:
sosreport of the node can be found under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1332047
From ganesha.log, 02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use) 02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue. 02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] unregister_fsal :FSAL :CRIT :Unregister FSAL GLUSTER with non-zero refcount=1 02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] glusterfs_unload :FSAL :CRIT :FSAL Gluster unable to unload. Dying ... Rquota port (875) was already in use by some other process. [root@dhcp37-180 ~]# netstat -ntaunlp | grep 875 tcp 0 0 10.70.37.180:875 10.70.37.127:24007 TIME_WAIT - It doesn't list any process pid which is using this port, but Looks like the port was being used by one of the processes which is/was connected to glusterd port. But seems very strange why its pid is not being listed in the above netstat command. However when configured a different port for RQuota in '/etc/ganesha/ganesha.conf', nfs-ganesha process has got started. NFS_Core_Param { #Use supplied name other tha IP In NSM operations NSM_Use_Caller_Name = true; #Copy lock states into "/var/lib/nfs/ganesha" dir Clustered = false; #By default port number '2049' is used for NFS service. #Configure ports for MNT, NLM, RQuota services. #The ports chosen here are from '/etc/sysconfig/nfs' MNT_Port = 20048; NLM_Port = 32803; Rquota_Port = 8750; } %include "/etc/ganesha/exports/export.tiervolume.conf" [root@dhcp37-180 ~]# [root@dhcp37-180 ~]# showmount -e localhost Export list for localhost: /tiervolume (everyone) [root@dhcp37-180 ~]# This seems like a known issue. Since the ports which we use for NLM/RQuota are not registered, we could occasionally run into these issues if there is any other process using them. We need to document to configure a different port in such cases, open that port via firewalld and then start nfs-ganesha. Please open another bug to track this issue. This bug can be used to track why '/var/lib/nfs' link has been missing. Thanks!
With respect to /var/lib/nfs folder missing, I tried to re-create the issue but not able to reproduce. One thing to note here is while setting up ganesha, we move existing /var/lib/nfs to /var/lib/nfs.backup folder and create a link to '/var/lib/nfs' to a folder in our shared_storage. While tearing down the ganesha setup, we restore back /var/lib/nfs/backup to '/var/lib/nfs'. Since current ganesha ocf scripts check for the presence of '/var/lib/nfs' before taking any action, if by an chance that folder link is removed by any other process, it shall leave the folder as is both during setup and teardown. I request Shashank to keep monitoring the state of '/var/lib/nfs' and provide definite steps of reproducing the issue.
(In reply to Soumya Koduri from comment #4) > With respect to /var/lib/nfs folder missing, I tried to re-create the issue > but not able to reproduce. > > One thing to note here is while setting up ganesha, we move existing > /var/lib/nfs to /var/lib/nfs.backup folder and create a link to > '/var/lib/nfs' to a folder in our shared_storage. While tearing down the > ganesha setup, we restore back /var/lib/nfs/backup to '/var/lib/nfs'. Since Sorry for the typo above. Its /var/lib/nfs.backup.
not seen in 3.1.3 testing. reopen if necessary for 3.2.