1466446 – [Ganesha] : Ganesha crashed while removing files from mount.

Bug 1466446 - [Ganesha] : Ganesha crashed while removing files from mount.

Summary: [Ganesha] : Ganesha crashed while removing files from mount.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Daniel Gryniewicz
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-06-29 15:46 UTC by Ambarish
Modified:	2017-09-21 04:47 UTC (History)
CC List:	11 users (show)
Fixed In Version:	nfs-ganesha-2.4.4-16
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:47:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:2779	0	normal	SHIPPED_LIVE	nfs-ganesha bug fix and enhancement update	2017-09-21 08:17:17 UTC

Description Ambarish 2017-06-29 15:46:37 UTC

Description of problem:
-----------------------

2 node setup,4 clients mount the gluster volume via v4 (2 clients:1 server).

Untar'd the tarball.

Tried to remove the files from the mount by triggering rm -rf * from multiple clients.


Ganesha crashed on one of my nodes dumped a core,this was the BT :

<BT>

(gdb) bt
#0  0x00007fc899b091f7 in raise () from /lib64/libc.so.6
#1  0x00007fc899b0a8e8 in abort () from /lib64/libc.so.6
#2  0x00007fc899b48f47 in __libc_message () from /lib64/libc.so.6
#3  0x00007fc899b50619 in _int_free () from /lib64/libc.so.6
#4  0x000055e71735fc3c in gsh_free (p=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/include/abstract_mem.h:271
#5  mdcache_key_delete (key=0x7fc5640230a0, key=0x7fc5640230a0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_int.h:492
#6  mdcache_unlink (dir_hdl=0x7fc57801b148, obj_hdl=0x7fc564022b98, name=0x7fc544014a70 "mach-at91")
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1271
#7  0x000055e7172987d4 in fsal_remove (parent=parent@entry=0x7fc57801b148, 
    name=0x7fc544014a70 "mach-at91") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1589
#8  0x000055e7172d3ccc in nfs4_op_remove (op=<optimized out>, data=<optimized out>, resp=0x7fc544000a00)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_remove.c:104
#9  0x000055e7172bf97d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fc54400f0d0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#10 0x000055e7172b0b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc7dc0008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#11 0x000055e7172b218a in worker_run (ctx=0x55e717d012e0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#12 0x000055e71733b889 in fridgethr_start_routine (arg=0x55e717d012e0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#13 0x00007fc89a4fee25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fc899bcc34d in clone () from /lib64/libc.so.6
(gdb) 


</BT> 

Version-Release number of selected component (if applicable):
------------------------------------------------------------

nfs-ganesha-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64


How reproducible:
-----------------

2/2 on my setup.


Actual results:
---------------

Ganesha crashed.

Expected results:
-----------------

No crashes.

Additional info:
----------------

 
Volume Name: vol
Type: Distribute
Volume ID: 13009662-ffd4-43c0-bfc3-46e18cd33b7e
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas007 tmp]#

Comment 2 Daniel Gryniewicz 2017-06-29 18:05:23 UTC

So, I think this may be a miscommunication between GFAPI and FSAL_GLUSTER (or possibly a bug in GFAPI).  MDCACHE, during unlink(), currently depends on the sub-FSAL returning an error from it's unlink() call to avoid a race of unlinks.

FSAL_GLUSTER calls glfs_h_unlink(), and handles a returned failure by converting errno to a status, and returning that.

glfs_h_unlink() finds the file to delete by calling glfs_resolve_at().  If this returns an error, glfs_h_unlink() also returns an error.  However, in this case, it never sets errno.  If errno was 0 before (likely, since the last GFAPI call probably succeeded), then FSAL_GLUSTER's file_unlink() will try to convert 0 into a status, which will be converted as success (ERR_FSAL_NO_ERROR).  This will potentially cause mdcache_unlink() to try to remove the file again, causing a double-free in this case.

I suspect that glfs_h_unlink() needs to set errno for all of it's error cases.  However, as a workaround, file_unlink() can make sure that it always returns an error status when glfs_h_unlink() returned error.

Comment 3 Soumya Koduri 2017-06-30 09:38:08 UTC

But in file_unlink() if errno is not 0, we seem to be setting it to EINVAL

static fsal_status_t file_unlink(struct fsal_obj_handle *dir_hdl,
                                 struct fsal_obj_handle *obj_hdl,
                                 const char *name)
{
...
...
        rc = glfs_h_unlink(glfs_export->gl_fs->fs, parenthandle->glhandle,
                           name);

        SET_GLUSTER_CREDS(glfs_export, NULL, NULL, 0, NULL);

        if (rc != 0)
                status = gluster2fsal_error(errno);


In fsal_status_t gluster2fsal_error(const int err)
{
        fsal_status_t status;
        int g_err = err;

        if (!g_err) {
                LogWarn(COMPONENT_FSAL, "appropriate errno not set");
                g_err = EINVAL;
        }
        status.minor = g_err;
        status.major = posix2fsal_error(g_err);

        return status;
}

Here in case of failures, status is set to EINVAL if backend has n't set errno. So IMO FSAL_GLUSTER never returns success if unlink fails. Or have I missed something?

Comment 4 Soumya Koduri 2017-06-30 09:49:31 UTC

(In reply to Soumya Koduri from comment #3)
> But in file_unlink() if errno is not 0, we seem to be setting it to EINVAL
> 
^^^^ correction - here I meant if errno is not set, status is mapped to EINVAL.

Comment 5 Daniel Gryniewicz 2017-06-30 15:13:49 UTC

Hmm... I missed that.  Okay, I'll have to look at this further.

Comment 6 Daniel Gryniewicz 2017-06-30 16:35:12 UTC

 https://review.gerrithub.io/367729

Comment 9 Daniel Gryniewicz 2017-07-05 13:23:05 UTC

This has been merged.  It needs to be backported to 2.4.x and to downstream.

Comment 10 Ambarish 2017-07-14 08:37:59 UTC

Failed QATP.

[root@gqas013 ~]# rpm -qa|grep ganesha-2
nfs-ganesha-2.4.4-15.el7rhgs.x86_64  (aka the one with non-root patches reverted)


Hit this crash again while removing files :

<BT>

Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N N'.
Program terminated with signal 6, Aborted.
#0  0x00007fe8f9b5c1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install gssproxy-0.7.0-4.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libblkid-2.23.2-43.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libnfsidmap-0.25-17.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64
(gdb) bt
#0  0x00007fe8f9b5c1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fe8f9b5d8e8 in __GI_abort () at abort.c:90
#2  0x00007fe8f9b9bf47 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fe8f9ca8608 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007fe8f9ba3619 in malloc_printerr (ar_ptr=0x7fe78c000020, ptr=<optimized out>, str=0x7fe8f9ca86c8 "double free or corruption (fasttop)", action=3) at malloc.c:5023
#4  _int_free (av=0x7fe78c000020, p=<optimized out>, have_lock=0) at malloc.c:3845
#5  0x000055b5cd45651f in gsh_free (p=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.4/src/include/abstract_mem.h:271
#6  mdcache_key_delete (key=0x7fe78c0a91f0, key=0x7fe78c0a91f0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_int.h:492
#7  mdcache_unlink (dir_hdl=0x7fe74c03cb28, obj_hdl=0x7fe78c0a8ce8, name=0x7fe638134790 "wiznet") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1274
#8  0x000055b5cd38e7d4 in fsal_remove (parent=parent@entry=0x7fe74c03cb28, name=0x7fe638134790 "wiznet") at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1589
#9  0x000055b5cd3c9ccc in nfs4_op_remove (op=<optimized out>, data=<optimized out>, resp=0x7fe63803afd0) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_remove.c:104
#10 0x000055b5cd3b597d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fe638037410) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#11 0x000055b5cd3a6b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fe8400008c0) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#12 0x000055b5cd3a818a in worker_run (ctx=0x55b5ce8ac680) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#13 0x000055b5cd431889 in fridgethr_start_routine (arg=0x55b5ce8ac680) at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#14 0x00007fe8fa551e25 in start_thread (arg=0x7fe858718700) at pthread_create.c:308
#15 0x00007fe8f9c1f34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

</BT>


Moving this back to Dev for a re-look.

Comment 13 Daniel Gryniewicz 2017-07-14 14:41:24 UTC

I think that's this one:
https://github.com/nfs-ganesha/nfs-ganesha/commit/0b169127b80259fd8e6fce08e2a62408d30524da

Comment 15 Ambarish 2017-07-18 03:03:28 UTC

Can  one of the Devs move this to ON_QA plz?

Comment 16 Ambarish 2017-07-19 17:35:35 UTC

Verified on nfs-ganesha-2.4.4-16.

Comment 18 errata-xmlrpc 2017-09-21 04:47:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2779

Note You need to log in before you can comment on or make changes to this bug.