Description of problem: Ganesha service crashed while doing refresh config on volume and when IOs are running parallely. Version-Release number of selected component (if applicable): nfs-ganesha-gluster-2.4.1-6.el7rhgs.x86_64 nfs-ganesha-2.4.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create ganesha cluster and create a volume. 2. Export the volume. 3. Run refresh config on the volume. /usr/libexec/ganesha/ganesha-ha.sh --refresh-config /var/run/gluster/shared_storage/nfs-ganesha/ vol_ec Actual results: Ganesha service got crashed. Expected results: No crash should be observed. Additional info: During refresh config and while IOs are running have seen the following two crashes. 1st crash: ---------- [Thread 0x7f8887e25700 (LWP 21862) exited] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f878aefd700 (LWP 22287)] inode_forget (inode=0x7f886e8908f4, nlookup=nlookup@entry=0) at inode.c:1132 1132 table = inode->table; (gdb) bt #0 inode_forget (inode=0x7f886e8908f4, nlookup=nlookup@entry=0) at inode.c:1132 #1 0x00007f88943ad81e in pub_glfs_h_close (object=0x7f880c003a80) at glfs-handleops.c:1364 #2 0x00007f88947c7cb9 in handle_release (obj_hdl=0x7f880c025a18) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71 #3 0x00007f8899053a23 in mdcache_lru_clean (entry=0x7f880c025ec0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421 #4 mdcache_lru_unref (entry=entry@entry=0x7f880c025ec0, flags=flags@entry=0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1464 #5 0x00007f88990510a1 in mdcache_put (entry=0x7f880c025ec0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.h:186 #6 mdcache_unexport (exp_hdl=0x7f87846478c0, root_obj=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_export.c:158 #7 0x00007f8899033386 in clean_up_export (root_obj=0x7f8784648b78, export=0x7f878400a128) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:2220 #8 release_export (export=0x7f878400a128) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:2264 #9 unexport (export=export@entry=0x7f878400a128) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:2287 #10 0x00007f88990431f8 in gsh_export_removeexport (args=<optimized out>, reply=<optimized out>, error=0x7f878aefc2e0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/export_mgr.c:1092 #11 0x00007f8899065869 in dbus_message_entrypoint (conn=0x7f889a4a5c30, msg=0x7f889a4a5eb0, user_data=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/dbus/dbus_server.c:512 #12 0x00007f88988fec76 in _dbus_object_tree_dispatch_and_unlock () from /lib64/libdbus-1.so.3 #13 0x00007f88988f0e49 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3 #14 0x00007f88988f10e2 in _dbus_connection_read_write_dispatch () from /lib64/libdbus-1.so.3 #15 0x00007f8899066931 in gsh_dbus_thread (arg=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/dbus/dbus_server.c:737 #16 0x00007f8897515dc5 in start_thread () from /lib64/libpthread.so.0 #17 0x00007f8896be473d in clone () from /lib64/libc.so.6 /var/log/messages snippet: -------------------------- Jan 18 16:01:02 dhcp46-111 kernel: ganesha.nfsd[21715]: segfault at 7f487c008288 ip 00007f49365796a0 sp 00007f493d03bf70 error 4 in libglusterfs.so.0.0.1[7f4936540000+ed000] Jan 18 16:01:02 dhcp46-111 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV [root@dhcp46-111 ~]# service nfs-ganesha status -l Redirecting to /bin/systemctl status -l nfs-ganesha.service ● nfs-ganesha.service - NFS-Ganesha file server Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha.service; disabled; vendor preset: disabled) Active: failed (Result: signal) since Wed 2017-01-18 16:01:02 IST; 1min 13s ago Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki Process: 21408 ExecStop=/bin/dbus-send --system --dest=org.ganesha.nfsd --type=method_call /org/ganesha/nfsd/admin org.ganesha.nfsd.admin.shutdown (code=exited, status=0/SUCCESS) Main PID: 21446 (code=killed, signal=SEGV) Jan 18 15:24:25 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: Starting NFS-Ganesha file server... Jan 18 15:24:25 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: Started NFS-Ganesha file server. Jan 18 16:01:02 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV Jan 18 16:01:02 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: Unit nfs-ganesha.service entered failed state. Jan 18 16:01:02 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: nfs-ganesha.service failed. [root@dhcp46-111 ~]# service nfs-ganesha start Redirecting to /bin/systemctl start nfs-ganesha.service 2nd crash: ---------- (gdb) bt #0 0x00007f3e9f9d5210 in pthread_spin_lock () from /lib64/libpthread.so.0 #1 0x00007f3e9c592ebd in inode_ctx_get0 (inode=0x7f3e76889954, xlator=xlator@entry=0x7f3d8c62ee20, value1=value1@entry=0x7f3dbd757ee0) at inode.c:2145 #2 0x00007f3e9c592f45 in inode_needs_lookup (inode=0x7f3e76889954, this=0x7f3d8c62ee20) at inode.c:1924 #3 0x00007f3e9c865c86 in __glfs_resolve_inode (fs=fs@entry=0x7f3d8c0153e0, subvol=subvol@entry=0x7f3ea6eed120, object=object@entry=0x7f3e24011f30) at glfs-resolve.c:1025 #4 0x00007f3e9c865d8b in glfs_resolve_inode (fs=fs@entry=0x7f3d8c0153e0, subvol=subvol@entry=0x7f3ea6eed120, object=object@entry=0x7f3e24011f30) at glfs-resolve.c:1051 #5 0x00007f3e9c867262 in pub_glfs_h_open (fs=0x7f3d8c0153e0, object=0x7f3e24011f30, flags=flags@entry=513) at glfs-handleops.c:637 #6 0x00007f3e9cc83160 in glusterfs_open_my_fd (objhandle=objhandle@entry=0x7f3e24014ea0, openflags=openflags@entry=66, posix_flags=513, my_fd=my_fd@entry=0x7f3dbd758120) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1029 #7 0x00007f3e9cc844ea in glusterfs_open2 (obj_hdl=0x7f3e24014ed8, state=0x7f3e2c8233f0, openflags=<optimized out>, createmode=FSAL_UNCHECKED, name=<optimized out>, attrib_set=<optimized out>, verifier=0x7f3dbd7586c0 "atime=11/01/2017 15:07:36 mtime=18/01/2017 17:42:15", new_obj=0x7f3dbd758340, attrs_out=0x7f3dbd758350, caller_perm_check=0x7f3dbd7584bf) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1336 #8 0x00007f3ea15190ef in mdcache_open2 (obj_hdl=0x7f3e2400bb58, state=0x7f3e2c8233f0, openflags=<optimized out>, createmode=FSAL_UNCHECKED, name=0x0, attrs_in=0x7f3dbd7585e0, verifier=0x7f3dbd7586c0 "atime=11/01/2017 15:07:36 mtime=18/01/2017 17:42:15", new_obj=0x7f3dbd758580, attrs_out=0x0, caller_perm_check=0x7f3dbd7584bf) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:657 #9 0x00007f3ea144de9b in fsal_open2 (in_obj=0x7f3e2400bb58, state=0x7f3e2c8233f0, openflags=openflags@entry=66, createmode=createmode@entry=FSAL_UNCHECKED, name=<optimized out>, attr=attr@entry=0x7f3dbd7585e0, verifier=verifier@entry=0x7f3dbd7586c0 "atime=11/01/2017 15:07:36 mtime=18/01/2017 17:42:15", obj=obj@entry=0x7f3dbd758580, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:1846 #10 0x00007f3ea1439486 in open4_ex (arg=arg@entry=0x7f3d88187728, data=data@entry=0x7f3dbd759180, res_OPEN4=res_OPEN4@entry=0x7f3e2c82e308, clientid=<optimized out>, owner=0x7f3e2c81c480, file_state=file_state@entry=0x7f3dbd758fa0, new_state=new_state@entry=0x7f3dbd758f8f) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_open.c:1441 #11 0x00007f3ea1481a49 in nfs4_op_open (op=0x7f3d88187720, data=0x7f3dbd759180, resp=0x7f3e2c82e300) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_open.c:1844 #12 0x00007f3ea1473f8d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f3e2c81a560) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734 #13 0x00007f3ea146513c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f3d880008c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #14 0x00007f3ea146679a in worker_run (ctx=0x7f3ea6eaac40) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #15 0x00007f3ea14f0409 in fridgethr_start_routine (arg=0x7f3ea6eaac40) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #16 0x00007f3e9f9d0dc5 in start_thread () from /lib64/libpthread.so.0 #17 0x00007f3e9f09f73d in clone () from /lib64/libc.so.6 sosreports and ganesha logs will be attached soon.
This looks similar to bug1413350 but this time not with root entry. There could have been md-cache entries re-used post volume unexport and re-export. Since this is not recommended use-case i.e, performing refresh-config or volume unexport while I/Os are going on the same volume (not sure if its documented, if not we must), could you try other valid scenarios and check if you hit this issue. Thanks!
sosreport and ganesha logs are at, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1414410/
Raised bug1415669 to document in the admin guide that it is recommended not to perform refresh-config while I/Os are going on any volume.
Checked the behavior multiple times with the build, [root@dhcp46-111 ~]# rpm -qa | grep ganesha glusterfs-ganesha-3.8.4-23.el7rhgs.x86_64 nfs-ganesha-gluster-2.4.4-4.el7rhgs.x86_64 nfs-ganesha-2.4.4-4.el7rhgs.x86_64 During refresh config and while IOs are running , crashes are not seen.
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101283
Verified this bug on glusterfs-ganesha-3.8.4-24.el7rhgs.x86_64 Performing refresh config when IO's are running doesn't lead to crash anymore. IO's continue to run after performing refresh-config with the support of dynamic export refresh-config option. Moving this bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774