Description of problem: ======================== Hit this crash while verifying BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1383559 Ganesha got crashed on node whose VIP was used to mount the volume on all the clients -------------- Reading symbols from /usr/bin/ganesha.nfsd...Reading symbols from /usr/lib/debug/usr/bin/ganesha.nfsd.debug...done. done. Missing separate debuginfo for /lib64/libntirpc.so.1.7 Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ba/f482e25a6a8dbfb3665ccc5c81f3bef51b5b30 Missing separate debuginfo for /lib64/libwbclient.so.0 Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/b4/1852bdd635e26adba49a0e2f4e2f6e0165e27b Missing separate debuginfo for Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ba/3372036854a6399a5260547cce7841b54ad536 Missing separate debuginfo for /lib64/libntirpc.so.1.7 Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ba/f482e25a6a8dbfb3665ccc5c81f3bef51b5b30.debug Missing separate debuginfo for /lib64/libwbclient.so.0 Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/b4/1852bdd635e26adba49a0e2f4e2f6e0165e27b.debug [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.c'. Program terminated with signal 11, Segmentation fault. #0 0x00007f4585db3a83 in setglustercreds (glfs_export=glfs_export@entry=0xffffffffffffffe0, uid=uid@entry=0x7f44d03c18d8, gid=gid@entry=0x7f44d03c18dc, ngrps=1, groups=0x7f44d00365c0, file=file@entry=0x7f4585db6378 "/builddir/build/BUILD/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c", line=line@entry=1180, function=function@entry=0x7f4585db6ae0 <__func__.24208> "glusterfs_close_my_fd") at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/gluster_internal.c:219 219 if (*uid != glfs_export->saveduid) Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 dbus-libs-1.10.24-13.el7_6.x86_64 elfutils-libelf-0.176-2.el7.x86_64 elfutils-libs-0.176-2.el7.x86_64 glibc-2.17-292.el7.x86_64 gssproxy-0.7.0-26.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-61.el7.x86_64 libcap-2.22-10.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libnfsidmap-0.25-19.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7.x86_64 libwbclient-4.9.1-6.el7.x86_64 lz4-1.7.5-3.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 samba-client-libs-4.9.1-6.el7.x86_64 systemd-libs-219-67.el7_7.1.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 0x00007f4585db3a83 in setglustercreds (glfs_export=glfs_export@entry=0xffffffffffffffe0, uid=uid@entry=0x7f44d03c18d8, gid=gid@entry=0x7f44d03c18dc, ngrps=1, groups=0x7f44d00365c0, file=file@entry=0x7f4585db6378 "/builddir/build/BUILD/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c", line=line@entry=1180, function=function@entry=0x7f4585db6ae0 <__func__.24208> "glusterfs_close_my_fd") at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/gluster_internal.c:219 #1 0x00007f4585dae010 in glusterfs_close_my_fd (my_fd=my_fd@entry=0x7f44d03c1890) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c:1177 #2 0x00007f4585dae140 in glusterfs_close2 (obj_hdl=0x7f44201f9328, state=0x7f44d03c17b0) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c:2828 #3 0x000055ed16ac69a7 in mdcache_close2 (obj_hdl=0x7f4544417f98, state=<optimized out>) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:830 #4 0x000055ed16a656c3 in _state_del_locked (state=0x7f44d03c17b0, func=func@entry=0x55ed16aff5d0 <__func__.20790> "state_nfs4_state_wipe", line=line@entry=640) at /usr/src/debug/nfs-ganesha-2.7.3/src/SAL/nfs4_state.c:398 #5 0x000055ed16a67078 in state_nfs4_state_wipe (ostate=0x7f4544418200) at /usr/src/debug/nfs-ganesha-2.7.3/src/SAL/nfs4_state.c:640 #6 0x000055ed16a5e14c in state_wipe_file (obj=0x7f4544417f98) at /usr/src/debug/nfs-ganesha-2.7.3/src/SAL/state_misc.c:1309 #7 0x000055ed16ab7aaf in _mdcache_lru_unref (entry=entry@entry=0x7f4544417f60, flags=flags@entry=0, func=func@entry=0x55ed16b0f780 <__func__.20988> "mdcache_put", line=line@entry=196) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1965 #8 0x000055ed16ac5148 in mdcache_put (entry=0x7f4544417f60) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.h:196 #9 mdc_read_cb (obj=<optimized out>, ret=..., obj_data=0x7f4485ee3d40, caller_data=0x7f4420124ea0) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:555 #10 0x00007f4585db2739 in glusterfs_read2 (obj_hdl=0x7f44201f9328, bypass=<optimized out>, done_cb=0x55ed16ac50b0 <mdc_read_cb>, read_arg=0x7f4485ee3d40, caller_arg=0x7f4420124ea0) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c:2084 #11 0x000055ed16ac6654 in mdcache_read2 (obj_hdl=0x7f4544417f98, bypass=<optimized out>, done_cb=<optimized out>, read_arg=0x7f4485ee3d40, caller_arg=<optimized out>) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:589 #12 0x000055ed16a15699 in nfs4_read (op=<optimized out>, data=<optimized out>, resp=0x7f4420505b90, io=<optimized out>, info=0x0) at /usr/src/debug/nfs-ganesha-2.7.3/src/Protocols/NFS/nfs4_op_read.c:562 #13 0x000055ed16a02703 in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f4420469170) at /usr/src/debug/nfs-ganesha-2.7.3/src/Protocols/NFS/nfs4_Compound.c:942 #14 0x000055ed169f5b1f in nfs_rpc_process_request (reqdata=0x7f4420115030) at /usr/src/debug/nfs-ganesha-2.7.3/src/MainNFSD/nfs_worker_thread.c:1328 #15 0x000055ed169f4fca in nfs_rpc_decode_request (xprt=0x7f4544000f90, xdrs=0x7f442004f5d0) at /usr/src/debug/nfs-ganesha-2.7.3/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1345 #16 0x00007f458cbd062d in svc_rqst_xprt_task () from /lib64/libntirpc.so.1.7 #17 0x00007f458cbd0b6a in svc_rqst_run_task () from /lib64/libntirpc.so.1.7 #18 0x00007f458cbd8c0b in work_pool_thread () from /lib64/libntirpc.so.1.7 #19 0x00007f458af6eea5 in start_thread () from /lib64/libpthread.so.0 #20 0x00007f458a8798cd in clone () from /lib64/libc.so.6 ---------------------- Version-Release number of selected component (if applicable): ========================================================= # rpm -qa | grep ganesha nfs-ganesha-2.7.3-7.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.7.3-7.el7rhgs.x86_64 glusterfs-ganesha-6.0-11.el7rhgs.x86_64 nfs-ganesha-gluster-2.7.3-7.el7rhgs.x86_64 How reproducible: ================ 1/1 Steps to Reproduce: ==================== 1.Create 4 node ganesha cluster 2.Create 1 Distributed-Disperse volume 2 x (4 + 2) = 12 3.Mount the volume on 3 clients via v4.1 4.Run Linux untars from all the 3 clients.Wait for untars to be completed 5.Now mount the volume on 2 more clients 4.Again run linux untars in different directories from 5 clients 5. Wait for around 1 hour and then run rm -rf * from another client with IO's still ongoing parallelly Actual results: ============== After some time ganesha got crashed on one of the node Expected results: ================ Ganesha should not crash Additional info: ================== [root@gprfs040 abrt]# pcs status Cluster name: ganesha-ha Stack: corosync Current DC: gprfs035.sbu.lab.eng.bos.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum Last updated: Fri Aug 30 03:36:39 2019 Last change: Thu Aug 29 17:47:10 2019 by root via crm_attribute on gprfs033.sbu.lab.eng.bos.redhat.com 4 nodes configured 24 resources configured Online: [ gprfs033.sbu.lab.eng.bos.redhat.com gprfs034.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ gprfs033.sbu.lab.eng.bos.redhat.com gprfs034.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ gprfs033.sbu.lab.eng.bos.redhat.com gprfs034.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ gprfs034.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com ] Stopped: [ gprfs033.sbu.lab.eng.bos.redhat.com ] Resource Group: gprfs033.sbu.lab.eng.bos.redhat.com-group gprfs033.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gprfs040.sbu.lab.eng.bos.redhat.com gprfs033.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gprfs040.sbu.lab.eng.bos.redhat.com gprfs033.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gprfs040.sbu.lab.eng.bos.redhat.com Resource Group: gprfs034.sbu.lab.eng.bos.redhat.com-group gprfs034.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gprfs034.sbu.lab.eng.bos.redhat.com gprfs034.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gprfs034.sbu.lab.eng.bos.redhat.com gprfs034.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gprfs034.sbu.lab.eng.bos.redhat.com Resource Group: gprfs035.sbu.lab.eng.bos.redhat.com-group gprfs035.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gprfs035.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gprfs035.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gprfs035.sbu.lab.eng.bos.redhat.com Resource Group: gprfs040.sbu.lab.eng.bos.redhat.com-group gprfs040.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gprfs040.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gprfs040.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gprfs040.sbu.lab.eng.bos.redhat.com Failed Resource Actions: * gprfs035.sbu.lab.eng.bos.redhat.com-nfs_unblock_monitor_10000 on gprfs035.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='', last-rc-change='Sun Aug 25 23:05:43 2019', queued=0ms, exec=0ms * gprfs033.sbu.lab.eng.bos.redhat.com-nfs_unblock_monitor_10000 on gprfs040.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=161, status=Timed Out, exitreason='', last-rc-change='Sun Aug 25 23:05:47 2019', queued=0ms, exec=0ms * gprfs040.sbu.lab.eng.bos.redhat.com-nfs_unblock_monitor_10000 on gprfs040.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=153, status=Timed Out, exitreason='', last-rc-change='Sun Aug 25 23:05:41 2019', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Is it possible to try recreating this with Ganesha V2.7.6? It looks like op_ctx->fsal_export becomes NULL at some point in the call stack. I can't identify a fix since V2.7.3 for sure, but there are a few that might address this issue. If it still fails for V2.7.3, we should also try next (V2.9-dev) to determine if it is an upstream problem.
Verified this BZ with # rpm -qa | grep ganesha nfs-ganesha-2.7.3-9.el7rhgs.x86_64 glusterfs-ganesha-6.0-15.el7rhgs.x86_64 nfs-ganesha-gluster-2.7.3-9.el7rhgs.x86_64 Steps for verification: =================== 1. Create 8 node ganesha cluster 2. Create Distributed-disperse 2 x (4 + 2) volume.Export the volume via ganesha 3. Mount the volume on 6 clients via 4.1 protocol 4. Start running linux untars from 2 clients and readdir operation from other 4 clients (du -sh,ls -laRt,find's) 5. Wait for linux untars to complete. 6. Now mount the volume on 2 more clients and run linux untars from 5 clients in diff dir's 7. Run IO's for 1 hour. 8. Now trigger rm -rf * from one of the client when IO's are running in parallel. No crash were observed. Moving this BZ to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3252