This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1471690 - [Ganesha] : Ganesha crashed (pub_glfs_fstat) when IO resumed post failover/failback.
[Ganesha] : Ganesha crashed (pub_glfs_fstat) when IO resumed post failover/fa...
Status: NEW
Product: nfs-ganesha
Classification: Community
Component: NFS (Show other bugs)
2.4
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Frank Filz
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-17 05:19 EDT by Ambarish
Modified: 2017-07-25 09:47 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ambarish 2017-07-17 05:19:33 EDT
Description of problem:
----------------------

4 node cluster,4 clients mounted the volume via v4 and were running kernel untar in separate directories.

I killed Ganesha on one of my nodes to simulate a failover,Ganesha crashed on another node as soon as the IO resumed post grace period and dumped the following core :


<BT>

(gdb) bt
#0  0x00007fcae63b47ca in pub_glfs_fstat (glfd=0x1000000000000000, stat=stat@entry=0x7fcab7ff51c0)
    at glfs-fops.c:377
#1  0x00007fcae67e117c in glusterfs_open2 (obj_hdl=0x7fc9ec010c20, state=0x7fc9ec00d8c0, 
    openflags=<optimized out>, createmode=FSAL_EXCLUSIVE, name=<optimized out>, attrib_set=<optimized out>, 
    verifier=0x7fcab7ff56c0 "\030\360,\001\070o", new_obj=0x7fcab7ff5340, attrs_out=0x7fcab7ff5350, 
    caller_perm_check=0x7fcab7ff54bf) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:1398
#2  0x00005623f0f701ef in mdcache_open2 (obj_hdl=0x7fc9ec006ef8, state=0x7fc9ec00d8c0, openflags=<optimized out>, 
    createmode=FSAL_EXCLUSIVE, name=0x0, attrs_in=0x7fcab7ff55e0, verifier=0x7fcab7ff56c0 "\030\360,\001\070o", 
    new_obj=0x7fcab7ff5580, attrs_out=0x0, caller_perm_check=0x7fcab7ff54bf)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:657
#3  0x00005623f0ea5cbb in fsal_open2 (in_obj=0x7fc9ec006ef8, state=0x7fc9ec00d8c0, openflags=openflags@entry=2, 
    createmode=createmode@entry=FSAL_EXCLUSIVE, name=<optimized out>, attr=attr@entry=0x7fcab7ff55e0, 
    verifier=verifier@entry=0x7fcab7ff56c0 "\030\360,\001\070o", obj=obj@entry=0x7fcab7ff5580, 
    attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1846
#4  0x00005623f0e91350 in open4_ex (arg=arg@entry=0x7fca30001878, data=data@entry=0x7fcab7ff6180, 
    res_OPEN4=res_OPEN4@entry=0x7fc9ec000ba8, clientid=<optimized out>, owner=0x7fca0c0116f0, 
    file_state=file_state@entry=0x7fcab7ff5fa0, new_state=new_state@entry=0x7fcab7ff5f8f)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_open.c:1441
#5  0x00005623f0ed9469 in nfs4_op_open (op=0x7fca30001870, data=0x7fcab7ff6180, resp=0x7fc9ec000ba0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_open.c:1845
#6  0x00005623f0ecb97d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fc9ec0009f0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#7  0x00005623f0ebcb1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fca300008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#8  0x00005623f0ebe18a in worker_run (ctx=0x5623f1c636c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#9  0x00005623f0f47889 in fridgethr_start_routine (arg=0x5623f1c636c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#10 0x00007fcae9743e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fcae8e1134d in clone () from /lib64/libc.so.6
(gdb) 

</BT>


I triggered a failback and it crashed again on another one of my nodes with the same BT.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-ganesha-3.8.4-33.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.4-14.el7rhgs.x86_64


How reproducible:
------------------

2/2
Comment 2 Soumya Koduri 2017-07-25 07:37:02 EDT
@Frank/Dan

The core analysis for this issue is here - https://bugzilla.redhat.com/show_bug.cgi?id=1469474#c5 

From code inspection at-least I cant see an issue with passing my_fd directly as glusterfs_open_my_fd input argument..
Even on gdb, I observed that my_fd values are intact after we come out of this function. Do you see any race here or am I missing to look at some basic fundamental issue here?
Comment 3 Daniel Gryniewicz 2017-07-25 08:33:08 EDT
passing my_fd is fine, but I think the issue was passing &my_fd, which likely causes memory corruption.
Comment 4 Soumya Koduri 2017-07-25 09:15:28 EDT
Sorry. do you mean other way around? Here is the current code flow

glusterfs_open2() {

...
        struct glusterfs_fd *my_fd = NULL;
....

        if (state != NULL)
                my_fd = &container_of(state, struct glusterfs_state_fd,
                                      state)->glusterfs_fd;

....
....
                /* truncate is set in p_flags */
                status = glusterfs_open_my_fd(myself, openflags, p_flags,
                                              my_fd);

>>> we are passing my_fd itself here.

           |
           v

fsal_status_t glusterfs_open_my_fd(struct glusterfs_handle *objhandle,
                                   fsal_openflags_t openflags,
                                   int posix_flags,
                                   struct glusterfs_fd *my_fd)
{


...
        my_fd->glfd = glfd;
        my_fd->openflags = openflags;
        my_fd->creds.caller_uid = op_ctx->creds->caller_uid;
        my_fd->creds.caller_gid = op_ctx->creds->caller_gid;
        my_fd->creds.caller_glen = op_ctx->creds->caller_glen;
        garray_copy = &my_fd->creds.caller_garray;

...

        if ((*garray_copy) != NULL) {
                /* Replace old creds */
                gsh_free(*garray_copy);

}


Or do you think the issue mentioned in https://review.gerrithub.io/#/c/371135/ wrt using caller_garray post its free may have caused this memory_corruption in my_fd as well here? I am trying to reproduce this issue to check if the https://review.gerrithub.io/#/c/371135/ fixes this issue as well. Thanks!
Comment 5 Daniel Gryniewicz 2017-07-25 09:22:58 EDT
No, I think we're in agreement.  The code above is correct; the code that was in place when https://bugzilla.redhat.com/show_bug.cgi?id=1469474 hit (before the revert) looked like this:

		/* truncate is set in p_flags */
		status = glusterfs_open_my_fd(myself, openflags, p_flags,
					       &my_fd);

Note &my_fd there.  That code is wrong.  I think it's unrelated to the garray_copy bits.
Comment 6 Soumya Koduri 2017-07-25 09:47:47 EDT
Oh..Dan++ thanks.. I dint notice that change in the backport. 

So this bug just needs re-test with 2.5 sources and then can mark it CLOSED/VERIFIED.

Note You need to log in before you can comment on or make changes to this bug.