Bug 1469474 - [Ganesha] : Ganesha crashed (pub_glfs_fstat) when IO resumed post failover/failback.
[Ganesha] : Ganesha crashed (pub_glfs_fstat) when IO resumed post failover/fa...
Status: CLOSED NOTABUG
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha (Show other bugs)
3.3
x86_64 Linux
unspecified Severity low
: ---
: ---
Assigned To: Soumya Koduri
Ambarish
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-11 06:49 EDT by Ambarish
Modified: 2017-07-17 05:21 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-17 05:21:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ambarish 2017-07-11 06:49:29 EDT
Description of problem:
----------------------

4 node cluster,4 clients mounted the volume via v4 and were running kernel untar in separate directories.

I killed Ganesha on one of my nodes to simulate a failover,Ganesha crashed on another node as soon as the IO resumed post grace period and dumped the following core :


<BT>

(gdb) bt
#0  0x00007fcae63b47ca in pub_glfs_fstat (glfd=0x1000000000000000, stat=stat@entry=0x7fcab7ff51c0)
    at glfs-fops.c:377
#1  0x00007fcae67e117c in glusterfs_open2 (obj_hdl=0x7fc9ec010c20, state=0x7fc9ec00d8c0, 
    openflags=<optimized out>, createmode=FSAL_EXCLUSIVE, name=<optimized out>, attrib_set=<optimized out>, 
    verifier=0x7fcab7ff56c0 "\030\360,\001\070o", new_obj=0x7fcab7ff5340, attrs_out=0x7fcab7ff5350, 
    caller_perm_check=0x7fcab7ff54bf) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:1398
#2  0x00005623f0f701ef in mdcache_open2 (obj_hdl=0x7fc9ec006ef8, state=0x7fc9ec00d8c0, openflags=<optimized out>, 
    createmode=FSAL_EXCLUSIVE, name=0x0, attrs_in=0x7fcab7ff55e0, verifier=0x7fcab7ff56c0 "\030\360,\001\070o", 
    new_obj=0x7fcab7ff5580, attrs_out=0x0, caller_perm_check=0x7fcab7ff54bf)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:657
#3  0x00005623f0ea5cbb in fsal_open2 (in_obj=0x7fc9ec006ef8, state=0x7fc9ec00d8c0, openflags=openflags@entry=2, 
    createmode=createmode@entry=FSAL_EXCLUSIVE, name=<optimized out>, attr=attr@entry=0x7fcab7ff55e0, 
    verifier=verifier@entry=0x7fcab7ff56c0 "\030\360,\001\070o", obj=obj@entry=0x7fcab7ff5580, 
    attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1846
#4  0x00005623f0e91350 in open4_ex (arg=arg@entry=0x7fca30001878, data=data@entry=0x7fcab7ff6180, 
    res_OPEN4=res_OPEN4@entry=0x7fc9ec000ba8, clientid=<optimized out>, owner=0x7fca0c0116f0, 
    file_state=file_state@entry=0x7fcab7ff5fa0, new_state=new_state@entry=0x7fcab7ff5f8f)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_open.c:1441
#5  0x00005623f0ed9469 in nfs4_op_open (op=0x7fca30001870, data=0x7fcab7ff6180, resp=0x7fc9ec000ba0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_open.c:1845
#6  0x00005623f0ecb97d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fc9ec0009f0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#7  0x00005623f0ebcb1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fca300008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#8  0x00005623f0ebe18a in worker_run (ctx=0x5623f1c636c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#9  0x00005623f0f47889 in fridgethr_start_routine (arg=0x5623f1c636c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#10 0x00007fcae9743e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fcae8e1134d in clone () from /lib64/libc.so.6
(gdb) 

</BT>


I triggered a failback and it crashed again on another one of my nodes with the same BT.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-ganesha-3.8.4-33.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.4-14.el7rhgs.x86_64


How reproducible:
------------------

2/2



Additional info:
----------------

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 8ff33add-f1ad-4ca0-a43c-3fbfce61cc4a
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
Comment 6 Daniel Gryniewicz 2017-07-11 09:21:39 EDT
This is caused by passing the address of my_fd to glustefs_open_my_fd(), I think.  It's not that way on 2.5, and my compiler complains.
Comment 12 Ambarish 2017-07-17 05:21:54 EDT
The upstream bug is in place https://bugzilla.redhat.com/show_bug.cgi?id=1471690

Closing this as NaB since these crashes aren't reproducible anymore with non-root patches reverted.

Note You need to log in before you can comment on or make changes to this bug.