Bug 1466446 - [Ganesha] : Ganesha crashed while removing files from mount.
[Ganesha] : Ganesha crashed while removing files from mount.
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha (Show other bugs)
3.3
x86_64 Linux
unspecified Severity high
: ---
: RHGS 3.3.0
Assigned To: Daniel Gryniewicz
Ambarish
:
Depends On:
Blocks: 1417151
  Show dependency treegraph
 
Reported: 2017-06-29 11:46 EDT by Ambarish
Modified: 2017-09-21 00:47 EDT (History)
11 users (show)

See Also:
Fixed In Version: nfs-ganesha-2.4.4-16
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-21 00:47:57 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ambarish 2017-06-29 11:46:37 EDT
Description of problem:
-----------------------

2 node setup,4 clients mount the gluster volume via v4 (2 clients:1 server).

Untar'd the tarball.

Tried to remove the files from the mount by triggering rm -rf * from multiple clients.


Ganesha crashed on one of my nodes dumped a core,this was the BT :

<BT>

(gdb) bt
#0  0x00007fc899b091f7 in raise () from /lib64/libc.so.6
#1  0x00007fc899b0a8e8 in abort () from /lib64/libc.so.6
#2  0x00007fc899b48f47 in __libc_message () from /lib64/libc.so.6
#3  0x00007fc899b50619 in _int_free () from /lib64/libc.so.6
#4  0x000055e71735fc3c in gsh_free (p=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/include/abstract_mem.h:271
#5  mdcache_key_delete (key=0x7fc5640230a0, key=0x7fc5640230a0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_int.h:492
#6  mdcache_unlink (dir_hdl=0x7fc57801b148, obj_hdl=0x7fc564022b98, name=0x7fc544014a70 "mach-at91")
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1271
#7  0x000055e7172987d4 in fsal_remove (parent=parent@entry=0x7fc57801b148, 
    name=0x7fc544014a70 "mach-at91") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1589
#8  0x000055e7172d3ccc in nfs4_op_remove (op=<optimized out>, data=<optimized out>, resp=0x7fc544000a00)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_remove.c:104
#9  0x000055e7172bf97d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fc54400f0d0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#10 0x000055e7172b0b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc7dc0008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#11 0x000055e7172b218a in worker_run (ctx=0x55e717d012e0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#12 0x000055e71733b889 in fridgethr_start_routine (arg=0x55e717d012e0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#13 0x00007fc89a4fee25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fc899bcc34d in clone () from /lib64/libc.so.6
(gdb) 


</BT> 

Version-Release number of selected component (if applicable):
------------------------------------------------------------

nfs-ganesha-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64


How reproducible:
-----------------

2/2 on my setup.


Actual results:
---------------

Ganesha crashed.

Expected results:
-----------------

No crashes.

Additional info:
----------------

 
Volume Name: vol
Type: Distribute
Volume ID: 13009662-ffd4-43c0-bfc3-46e18cd33b7e
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas007 tmp]#
Comment 2 Daniel Gryniewicz 2017-06-29 14:05:23 EDT
So, I think this may be a miscommunication between GFAPI and FSAL_GLUSTER (or possibly a bug in GFAPI).  MDCACHE, during unlink(), currently depends on the sub-FSAL returning an error from it's unlink() call to avoid a race of unlinks.

FSAL_GLUSTER calls glfs_h_unlink(), and handles a returned failure by converting errno to a status, and returning that.

glfs_h_unlink() finds the file to delete by calling glfs_resolve_at().  If this returns an error, glfs_h_unlink() also returns an error.  However, in this case, it never sets errno.  If errno was 0 before (likely, since the last GFAPI call probably succeeded), then FSAL_GLUSTER's file_unlink() will try to convert 0 into a status, which will be converted as success (ERR_FSAL_NO_ERROR).  This will potentially cause mdcache_unlink() to try to remove the file again, causing a double-free in this case.

I suspect that glfs_h_unlink() needs to set errno for all of it's error cases.  However, as a workaround, file_unlink() can make sure that it always returns an error status when glfs_h_unlink() returned error.
Comment 3 Soumya Koduri 2017-06-30 05:38:08 EDT
But in file_unlink() if errno is not 0, we seem to be setting it to EINVAL

static fsal_status_t file_unlink(struct fsal_obj_handle *dir_hdl,
                                 struct fsal_obj_handle *obj_hdl,
                                 const char *name)
{
...
...
        rc = glfs_h_unlink(glfs_export->gl_fs->fs, parenthandle->glhandle,
                           name);

        SET_GLUSTER_CREDS(glfs_export, NULL, NULL, 0, NULL);

        if (rc != 0)
                status = gluster2fsal_error(errno);


In fsal_status_t gluster2fsal_error(const int err)
{
        fsal_status_t status;
        int g_err = err;

        if (!g_err) {
                LogWarn(COMPONENT_FSAL, "appropriate errno not set");
                g_err = EINVAL;
        }
        status.minor = g_err;
        status.major = posix2fsal_error(g_err);

        return status;
}

Here in case of failures, status is set to EINVAL if backend has n't set errno. So IMO FSAL_GLUSTER never returns success if unlink fails. Or have I missed something?
Comment 4 Soumya Koduri 2017-06-30 05:49:31 EDT
(In reply to Soumya Koduri from comment #3)
> But in file_unlink() if errno is not 0, we seem to be setting it to EINVAL
> 
^^^^ correction - here I meant if errno is not set, status is mapped to EINVAL.
Comment 5 Daniel Gryniewicz 2017-06-30 11:13:49 EDT
Hmm... I missed that.  Okay, I'll have to look at this further.
Comment 6 Daniel Gryniewicz 2017-06-30 12:35:12 EDT
 https://review.gerrithub.io/367729
Comment 9 Daniel Gryniewicz 2017-07-05 09:23:05 EDT
This has been merged.  It needs to be backported to 2.4.x and to downstream.
Comment 10 Ambarish 2017-07-14 04:37:59 EDT
Failed QATP.

[root@gqas013 ~]# rpm -qa|grep ganesha-2
nfs-ganesha-2.4.4-15.el7rhgs.x86_64  (aka the one with non-root patches reverted)


Hit this crash again while removing files :

<BT>

Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N N'.
Program terminated with signal 6, Aborted.
#0  0x00007fe8f9b5c1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install gssproxy-0.7.0-4.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libblkid-2.23.2-43.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libnfsidmap-0.25-17.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64
(gdb) bt
#0  0x00007fe8f9b5c1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fe8f9b5d8e8 in __GI_abort () at abort.c:90
#2  0x00007fe8f9b9bf47 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fe8f9ca8608 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007fe8f9ba3619 in malloc_printerr (ar_ptr=0x7fe78c000020, ptr=<optimized out>, str=0x7fe8f9ca86c8 "double free or corruption (fasttop)", action=3) at malloc.c:5023
#4  _int_free (av=0x7fe78c000020, p=<optimized out>, have_lock=0) at malloc.c:3845
#5  0x000055b5cd45651f in gsh_free (p=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.4/src/include/abstract_mem.h:271
#6  mdcache_key_delete (key=0x7fe78c0a91f0, key=0x7fe78c0a91f0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_int.h:492
#7  mdcache_unlink (dir_hdl=0x7fe74c03cb28, obj_hdl=0x7fe78c0a8ce8, name=0x7fe638134790 "wiznet") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1274
#8  0x000055b5cd38e7d4 in fsal_remove (parent=parent@entry=0x7fe74c03cb28, name=0x7fe638134790 "wiznet") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1589
#9  0x000055b5cd3c9ccc in nfs4_op_remove (op=<optimized out>, data=<optimized out>, resp=0x7fe63803afd0) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_remove.c:104
#10 0x000055b5cd3b597d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fe638037410) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#11 0x000055b5cd3a6b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fe8400008c0) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#12 0x000055b5cd3a818a in worker_run (ctx=0x55b5ce8ac680) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#13 0x000055b5cd431889 in fridgethr_start_routine (arg=0x55b5ce8ac680) at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#14 0x00007fe8fa551e25 in start_thread (arg=0x7fe858718700) at pthread_create.c:308
#15 0x00007fe8f9c1f34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

</BT>


Moving this back to Dev for a re-look.
Comment 13 Daniel Gryniewicz 2017-07-14 10:41:24 EDT
I think that's this one:
https://github.com/nfs-ganesha/nfs-ganesha/commit/0b169127b80259fd8e6fce08e2a62408d30524da
Comment 15 Ambarish 2017-07-17 23:03:28 EDT
Can  one of the Devs move this to ON_QA plz?
Comment 16 Ambarish 2017-07-19 13:35:35 EDT
Verified on nfs-ganesha-2.4.4-16.
Comment 18 errata-xmlrpc 2017-09-21 00:47:57 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2779

Note You need to log in before you can comment on or make changes to this bug.