Bug 1471690

Summary:	[Ganesha] : Ganesha crashed (pub_glfs_fstat) when IO resumed post failover/failback.
Product:	[Retired] nfs-ganesha	Reporter:	Ambarish <asoman>
Component:	NFS	Assignee:	Frank Filz <ffilz>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2.4	CC:	bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, pasik, rhinduja, skoduri
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-22 15:48:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ambarish 2017-07-17 09:19:33 UTC

Description of problem:
----------------------

4 node cluster,4 clients mounted the volume via v4 and were running kernel untar in separate directories.

I killed Ganesha on one of my nodes to simulate a failover,Ganesha crashed on another node as soon as the IO resumed post grace period and dumped the following core :


<BT>

(gdb) bt
#0  0x00007fcae63b47ca in pub_glfs_fstat (glfd=0x1000000000000000, stat=stat@entry=0x7fcab7ff51c0)
    at glfs-fops.c:377
#1  0x00007fcae67e117c in glusterfs_open2 (obj_hdl=0x7fc9ec010c20, state=0x7fc9ec00d8c0, 
    openflags=<optimized out>, createmode=FSAL_EXCLUSIVE, name=<optimized out>, attrib_set=<optimized out>, 
    verifier=0x7fcab7ff56c0 "\030\360,\001\070o", new_obj=0x7fcab7ff5340, attrs_out=0x7fcab7ff5350, 
    caller_perm_check=0x7fcab7ff54bf) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:1398
#2  0x00005623f0f701ef in mdcache_open2 (obj_hdl=0x7fc9ec006ef8, state=0x7fc9ec00d8c0, openflags=<optimized out>, 
    createmode=FSAL_EXCLUSIVE, name=0x0, attrs_in=0x7fcab7ff55e0, verifier=0x7fcab7ff56c0 "\030\360,\001\070o", 
    new_obj=0x7fcab7ff5580, attrs_out=0x0, caller_perm_check=0x7fcab7ff54bf)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:657
#3  0x00005623f0ea5cbb in fsal_open2 (in_obj=0x7fc9ec006ef8, state=0x7fc9ec00d8c0, openflags=openflags@entry=2, 
    createmode=createmode@entry=FSAL_EXCLUSIVE, name=<optimized out>, attr=attr@entry=0x7fcab7ff55e0, 
    verifier=verifier@entry=0x7fcab7ff56c0 "\030\360,\001\070o", obj=obj@entry=0x7fcab7ff5580, 
    attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1846
#4  0x00005623f0e91350 in open4_ex (arg=arg@entry=0x7fca30001878, data=data@entry=0x7fcab7ff6180, 
    res_OPEN4=res_OPEN4@entry=0x7fc9ec000ba8, clientid=<optimized out>, owner=0x7fca0c0116f0, 
    file_state=file_state@entry=0x7fcab7ff5fa0, new_state=new_state@entry=0x7fcab7ff5f8f)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_open.c:1441
#5  0x00005623f0ed9469 in nfs4_op_open (op=0x7fca30001870, data=0x7fcab7ff6180, resp=0x7fc9ec000ba0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_open.c:1845
#6  0x00005623f0ecb97d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fc9ec0009f0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#7  0x00005623f0ebcb1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fca300008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#8  0x00005623f0ebe18a in worker_run (ctx=0x5623f1c636c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#9  0x00005623f0f47889 in fridgethr_start_routine (arg=0x5623f1c636c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#10 0x00007fcae9743e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fcae8e1134d in clone () from /lib64/libc.so.6
(gdb) 

</BT>


I triggered a failback and it crashed again on another one of my nodes with the same BT.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-ganesha-3.8.4-33.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.4-14.el7rhgs.x86_64


How reproducible:
------------------

2/2

Comment 2 Soumya Koduri 2017-07-25 11:37:02 UTC

@Frank/Dan

The core analysis for this issue is here - https://bugzilla.redhat.com/show_bug.cgi?id=1469474#c5 

From code inspection at-least I cant see an issue with passing my_fd directly as glusterfs_open_my_fd input argument..
Even on gdb, I observed that my_fd values are intact after we come out of this function. Do you see any race here or am I missing to look at some basic fundamental issue here?

Comment 3 Daniel Gryniewicz 2017-07-25 12:33:08 UTC

passing my_fd is fine, but I think the issue was passing &my_fd, which likely causes memory corruption.

Comment 4 Soumya Koduri 2017-07-25 13:15:28 UTC

Sorry. do you mean other way around? Here is the current code flow

glusterfs_open2() {

...
        struct glusterfs_fd *my_fd = NULL;
....

        if (state != NULL)
                my_fd = &container_of(state, struct glusterfs_state_fd,
                                      state)->glusterfs_fd;

....
....
                /* truncate is set in p_flags */
                status = glusterfs_open_my_fd(myself, openflags, p_flags,
                                              my_fd);

>>> we are passing my_fd itself here.

           |
           v

fsal_status_t glusterfs_open_my_fd(struct glusterfs_handle *objhandle,
                                   fsal_openflags_t openflags,
                                   int posix_flags,
                                   struct glusterfs_fd *my_fd)
{


...
        my_fd->glfd = glfd;
        my_fd->openflags = openflags;
        my_fd->creds.caller_uid = op_ctx->creds->caller_uid;
        my_fd->creds.caller_gid = op_ctx->creds->caller_gid;
        my_fd->creds.caller_glen = op_ctx->creds->caller_glen;
        garray_copy = &my_fd->creds.caller_garray;

...

        if ((*garray_copy) != NULL) {
                /* Replace old creds */
                gsh_free(*garray_copy);

}


Or do you think the issue mentioned in https://review.gerrithub.io/#/c/371135/ wrt using caller_garray post its free may have caused this memory_corruption in my_fd as well here? I am trying to reproduce this issue to check if the https://review.gerrithub.io/#/c/371135/ fixes this issue as well. Thanks!

Comment 5 Daniel Gryniewicz 2017-07-25 13:22:58 UTC

No, I think we're in agreement.  The code above is correct; the code that was in place when https://bugzilla.redhat.com/show_bug.cgi?id=1469474 hit (before the revert) looked like this:

		/* truncate is set in p_flags */
		status = glusterfs_open_my_fd(myself, openflags, p_flags,
					       &my_fd);

Note &my_fd there.  That code is wrong.  I think it's unrelated to the garray_copy bits.

Comment 6 Soumya Koduri 2017-07-25 13:47:47 UTC

Oh..Dan++ thanks.. I dint notice that change in the backport. 

So this bug just needs re-test with 2.5 sources and then can mark it CLOSED/VERIFIED.