Description of problem: While doing validation of BZ 1492695, observed, while exporting/unexporting multiple volume in loop,Ganesha process goes into failed state and then restart on its own. Consequences- When ganesha process restart post entering failed state,1-2 volume exports fails on that particular node.Now for those 2 volume unexport fails with "Dynamic export addition/deletion failed" (as export entries not same across nodes) snippet ganesha.log -----. 11/05/2018 22:10:41 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume ecvol2 exported at : '/' 11/05/2018 22:10:42 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 11/05/2018 22:10:42 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. FDs. Disabling FD cache. 11/05/2018 22:10:44 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] lru_run_lane :INODE :FATAL :An entry (0x7f4584acec80) having an unmappable export_id (103) is unexpected 11/05/2018 22:10:44 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] unregister_fsal :FSAL :CRIT :Unregister FSAL GLUSTER with non-zero refcount=4 11/05/2018 22:10:44 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] glusterfs_unload :FSAL :CRIT :FSAL Gluster unable to unload. Dying ... 11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7685[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 2.5.5 11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed 11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper. 11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized. 11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] glusterfs_create_export :FSAL :EVENT :Volume Ganeshavol1 exported at : '/' 11/05/2018 22:10:49 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] glusterfs_create_export :FSAL :EVENT :Volume Ganeshavol2 exported at : '/' 11/05/2018 22:10:51 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] glusterfs_create_export :FSAL :EVENT :Volume ecvol1 exported at : '/' 11/05/2018 22:10:52 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] glusterfs_create_export :FSAL :EVENT :Volume ecvol2 exported at : '/' -------------------- /var/log/messages -------------- May 11 22:10:01 moonshine systemd: Starting Session 16760 of user root. May 11 22:10:01 moonshine systemd: Started Session 16761 of user root. May 11 22:10:01 moonshine systemd: Starting Session 16761 of user root. May 11 22:10:44 moonshine systemd: nfs-ganesha.service: main process exited, code=exited, status=2/INVALIDARGUMENT May 11 22:10:44 moonshine systemd: Unit nfs-ganesha.service entered failed state. May 11 22:10:44 moonshine systemd: nfs-ganesha.service failed. May 11 22:10:47 moonshine systemd: nfs-ganesha.service holdoff time over, scheduling restart. May 11 22:10:47 moonshine systemd: Starting Process NFS-Ganesha configuration... May 11 22:10:47 moonshine systemd: Started Process NFS-Ganesha configuration. May 11 22:10:47 moonshine systemd: Starting NFS-Ganesha file server... May 11 22:10:49 moonshine systemd: Started NFS-Ganesha file server. May 11 22:10:53 moonshine rpc.statd[17012]: Received SM_UNMON_ALL request from moonshine.lab.eng.blr.redhat.com while not monitoring any hosts May 11 22:11:02 moonshine systemd: Started Session 16762 of user root. May 11 22:11:02 moonshine systemd: Starting Session 16762 of user root. May 11 22:12:01 moonshine systemd: Started Session 16763 of user root. May 11 22:12:01 moonshine systemd: Starting Session 16763 of user root. May 11 22:13:01 moonshine systemd: Started Session 16764 of user root. -------------- - When ganesha process entered failed state,there is no core dumped by ganesha process.Unable to capture via gdb as well ----------- [Thread 0x7f44301ba700 (LWP 18266) exited] [Thread 0x7f4430179700 (LWP 18267) exited] [Thread 0x7f4430138700 (LWP 18268) exited] [Thread 0x7f44300f7700 (LWP 18269) exited] [Thread 0x7f465a441100 (LWP 15340) exited] [Inferior 1 (process 15340) exited with code 02] (gdb) bt No stack. (gdb) bt No stack. ---------------- Export entry for ecvol3 is missing in showmount -e on that particular node While exporting cli shows success-- --------- # sh start_volume.sh volume set: success volume set: success volume set: success volume set: success volume set: success volume set: success volume set: success volume set: success volume set: success volume set: success volume set: success --------------- # showmount -e Export list for moonshine.lab.eng.blr.redhat.com: /Ganeshavol1 (everyone) /Ganeshavol2 (everyone) /ecvol1 (everyone) /ecvol2 (everyone) /ecvol4 (everyone) /ecvol5 (everyone) /ecvol6 (everyone) /ecvol7 (everyone) /ecvol8 (everyone) ---------------- Unexport failed for ecvol3 # sh start_volume.sh volume set: success volume set: success volume set: failed: Dynamic export addition/deletion failed. Please see log file for details volume set: success volume set: success Version-Release number of selected component (if applicable): # rpm -qa | grep ganesha nfs-ganesha-gluster-2.5.5-6.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-6.el7rhgs.x86_64 nfs-ganesha-2.5.5-6.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-9.el7rhgs.x86_64 How reproducible: Managed to hit it 3 times Steps to Reproduce: 1.Create 8 node ganesha cluster 2.Create 2 distributed-disperse volume 6 x (4 + 2) (Ganeshavol1 and Ganeshavol2) and export it via ganesha 3.Mount the volume to 4 clients.Each volume mounted to 2 clients via v3/v4 simultaneously running bonnie and dbench 4.While IO's is in process create 98 other EC vols and start exporting/unexporting in loop Actual results: Ganesha service entered failed state while export/unexport is going on Expected results: Ganesha process should not enter fail state Additional info: Attaching sosreport and ganesha logs shortly
This should close the race: https://review.gerrithub.io/#/c/ffilz/nfs-ganesha/+/411912
Are there unexports happening here? Or only exports?
Okay, this is another race (and possibly a third race): https://review.gerrithub.io/#/c/ffilz/nfs-ganesha/+/419268
Verified this with # rpm -qa | grep ganesha nfs-ganesha-2.5.5-9.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-9.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-15.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-9.el7rhgs.x86_64 Exported/unexported 98 volumes in loop.No ganesha service failure was observed.Moving this BZ to verified state.
I would suggest something like this: "When exporting or unexporting on a live Ganesha, there were several races between the cleanup thread and the export ID cache. Close these races with a combination of refcounting, locking, and code re-organization." I'm not sure how detailed these need to be; if lots of detail is necessary, then the 2 commit messages are complete and detailed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2610