Description of problem: ----------------------- 2 node setup,4 clients mount the gluster volume via v4 (2 clients:1 server). Untar'd the tarball. Tried to remove the files from the mount by triggering rm -rf * from multiple clients. Ganesha crashed on one of my nodes dumped a core,this was the BT : <BT> (gdb) bt #0 0x00007fc899b091f7 in raise () from /lib64/libc.so.6 #1 0x00007fc899b0a8e8 in abort () from /lib64/libc.so.6 #2 0x00007fc899b48f47 in __libc_message () from /lib64/libc.so.6 #3 0x00007fc899b50619 in _int_free () from /lib64/libc.so.6 #4 0x000055e71735fc3c in gsh_free (p=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.4/src/include/abstract_mem.h:271 #5 mdcache_key_delete (key=0x7fc5640230a0, key=0x7fc5640230a0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_int.h:492 #6 mdcache_unlink (dir_hdl=0x7fc57801b148, obj_hdl=0x7fc564022b98, name=0x7fc544014a70 "mach-at91") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1271 #7 0x000055e7172987d4 in fsal_remove (parent=parent@entry=0x7fc57801b148, name=0x7fc544014a70 "mach-at91") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1589 #8 0x000055e7172d3ccc in nfs4_op_remove (op=<optimized out>, data=<optimized out>, resp=0x7fc544000a00) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_remove.c:104 #9 0x000055e7172bf97d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fc54400f0d0) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734 #10 0x000055e7172b0b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc7dc0008c0) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281 #11 0x000055e7172b218a in worker_run (ctx=0x55e717d012e0) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548 #12 0x000055e71733b889 in fridgethr_start_routine (arg=0x55e717d012e0) at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550 #13 0x00007fc89a4fee25 in start_thread () from /lib64/libpthread.so.0 #14 0x00007fc899bcc34d in clone () from /lib64/libc.so.6 (gdb) </BT> Version-Release number of selected component (if applicable): ------------------------------------------------------------ nfs-ganesha-2.4.4-10.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64 How reproducible: ----------------- 2/2 on my setup. Actual results: --------------- Ganesha crashed. Expected results: ----------------- No crashes. Additional info: ---------------- Volume Name: vol Type: Distribute Volume ID: 13009662-ffd4-43c0-bfc3-46e18cd33b7e Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas007 tmp]#
So, I think this may be a miscommunication between GFAPI and FSAL_GLUSTER (or possibly a bug in GFAPI). MDCACHE, during unlink(), currently depends on the sub-FSAL returning an error from it's unlink() call to avoid a race of unlinks. FSAL_GLUSTER calls glfs_h_unlink(), and handles a returned failure by converting errno to a status, and returning that. glfs_h_unlink() finds the file to delete by calling glfs_resolve_at(). If this returns an error, glfs_h_unlink() also returns an error. However, in this case, it never sets errno. If errno was 0 before (likely, since the last GFAPI call probably succeeded), then FSAL_GLUSTER's file_unlink() will try to convert 0 into a status, which will be converted as success (ERR_FSAL_NO_ERROR). This will potentially cause mdcache_unlink() to try to remove the file again, causing a double-free in this case. I suspect that glfs_h_unlink() needs to set errno for all of it's error cases. However, as a workaround, file_unlink() can make sure that it always returns an error status when glfs_h_unlink() returned error.
But in file_unlink() if errno is not 0, we seem to be setting it to EINVAL static fsal_status_t file_unlink(struct fsal_obj_handle *dir_hdl, struct fsal_obj_handle *obj_hdl, const char *name) { ... ... rc = glfs_h_unlink(glfs_export->gl_fs->fs, parenthandle->glhandle, name); SET_GLUSTER_CREDS(glfs_export, NULL, NULL, 0, NULL); if (rc != 0) status = gluster2fsal_error(errno); In fsal_status_t gluster2fsal_error(const int err) { fsal_status_t status; int g_err = err; if (!g_err) { LogWarn(COMPONENT_FSAL, "appropriate errno not set"); g_err = EINVAL; } status.minor = g_err; status.major = posix2fsal_error(g_err); return status; } Here in case of failures, status is set to EINVAL if backend has n't set errno. So IMO FSAL_GLUSTER never returns success if unlink fails. Or have I missed something?
(In reply to Soumya Koduri from comment #3) > But in file_unlink() if errno is not 0, we seem to be setting it to EINVAL > ^^^^ correction - here I meant if errno is not set, status is mapped to EINVAL.
Hmm... I missed that. Okay, I'll have to look at this further.
https://review.gerrithub.io/367729
This has been merged. It needs to be backported to 2.4.x and to downstream.
Failed QATP. [root@gqas013 ~]# rpm -qa|grep ganesha-2 nfs-ganesha-2.4.4-15.el7rhgs.x86_64 (aka the one with non-root patches reverted) Hit this crash again while removing files : <BT> Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N N'. Program terminated with signal 6, Aborted. #0 0x00007fe8f9b5c1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); Missing separate debuginfos, use: debuginfo-install gssproxy-0.7.0-4.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libblkid-2.23.2-43.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libnfsidmap-0.25-17.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 (gdb) bt #0 0x00007fe8f9b5c1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007fe8f9b5d8e8 in __GI_abort () at abort.c:90 #2 0x00007fe8f9b9bf47 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fe8f9ca8608 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196 #3 0x00007fe8f9ba3619 in malloc_printerr (ar_ptr=0x7fe78c000020, ptr=<optimized out>, str=0x7fe8f9ca86c8 "double free or corruption (fasttop)", action=3) at malloc.c:5023 #4 _int_free (av=0x7fe78c000020, p=<optimized out>, have_lock=0) at malloc.c:3845 #5 0x000055b5cd45651f in gsh_free (p=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.4/src/include/abstract_mem.h:271 #6 mdcache_key_delete (key=0x7fe78c0a91f0, key=0x7fe78c0a91f0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_int.h:492 #7 mdcache_unlink (dir_hdl=0x7fe74c03cb28, obj_hdl=0x7fe78c0a8ce8, name=0x7fe638134790 "wiznet") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1274 #8 0x000055b5cd38e7d4 in fsal_remove (parent=parent@entry=0x7fe74c03cb28, name=0x7fe638134790 "wiznet") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1589 #9 0x000055b5cd3c9ccc in nfs4_op_remove (op=<optimized out>, data=<optimized out>, resp=0x7fe63803afd0) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_remove.c:104 #10 0x000055b5cd3b597d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fe638037410) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734 #11 0x000055b5cd3a6b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fe8400008c0) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281 #12 0x000055b5cd3a818a in worker_run (ctx=0x55b5ce8ac680) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548 #13 0x000055b5cd431889 in fridgethr_start_routine (arg=0x55b5ce8ac680) at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550 #14 0x00007fe8fa551e25 in start_thread (arg=0x7fe858718700) at pthread_create.c:308 #15 0x00007fe8f9c1f34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 </BT> Moving this back to Dev for a re-look.
I think that's this one: https://github.com/nfs-ganesha/nfs-ganesha/commit/0b169127b80259fd8e6fce08e2a62408d30524da
Can one of the Devs move this to ON_QA plz?
Verified on nfs-ganesha-2.4.4-16.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2779