Bug 1466700

Summary: [Ganesha] : Ganesha crashed while running dbench during handle_digest
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED WONTFIX QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, asoman, bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, rcyriac, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 07:08:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ambarish 2017-06-30 09:38:28 UTC
Description of problem:
------------------------

2 node Ganesha HA cluster, 4 clients mounted a gluster volume via v4 and ran dbench in loop.

Ganesha crashed on one of my nodes.

This is the BT from the core :

<BT>

(gdb) bt
#0  0x00007f37f6a7b4f4 in handle_digest (obj_hdl=0x7f36680021a8, output_type=FSAL_DIGEST_NFSV4, 
    fh_desc=0x7f3759728530) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:2556
#1  0x00005612247ce151 in mdcache_handle_digest (obj_hdl=<optimized out>, out_type=<optimized out>, 
    fh_desc=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1352
#2  0x00005612247a0e99 in nfs4_FSALToFhandle (allocate=allocate@entry=false, fh4=fh4@entry=0x7f3759728a00, 
    fsalhandle=fsalhandle@entry=0x7f3668008028, exp=0x561224bdef48)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/nfs_filehandle_mgmt.c:143
#3  0x00005612247423a6 in nfs4_readdir_callback (opaque=0x7f3759728b90, obj=0x7f3668008028, attr=0x7f3759728d40, 
    mounted_on_fileid=11966847287241890925, cookie=<optimized out>, cb_state=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_readdir.c:325
#4  0x0000561224706829 in populate_dirent (name=<optimized out>, obj=0x7f3668008028, 
    attrs=attrs@entry=0x7f3759728d40, dir_state=dir_state@entry=0x7f3759728e90, cookie=18221028889009217444)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1321
#5  0x00005612247cefef in mdcache_readdir (dir_hdl=0x7f368800f908, whence=<optimized out>, dir_state=0x7f3759728e90, 
    cb=0x5612247067d0 <populate_dirent>, attrmask=122830, eod_met=0x7f3759728f5b)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:707
#6  0x00005612247085dd in fsal_readdir (directory=directory@entry=0x7f368800f908, cookie=cookie@entry=0, 
    nbfound=nbfound@entry=0x7f3759728f5c, eod_met=eod_met@entry=0x7f3759728f5b, attrmask=122830, 
    cb=cb@entry=0x561224741f40 <nfs4_readdir_callback>, opaque=opaque@entry=0x7f3759728f60)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:1505
#7  0x0000561224742f0b in nfs4_op_readdir (op=0x7f37040011b0, data=0x7f3759729180, resp=0x7f34f8015ae0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_readdir.c:631
#8  0x000056122472f97d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f34f8009840)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#9  0x0000561224720b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f37040008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#10 0x000056122472218a in worker_run (ctx=0x561227c94a70)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#11 0x00005612247ab889 in fridgethr_start_routine (arg=0x561227c94a70)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#12 0x00007f37f99e0e25 in start_thread (arg=0x7f375972a700) at pthread_create.c:308
#13 0x00007f37f90ae34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 

</BT>

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.4-10.el7rhgs.x86_64

glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64


How reproducible:
----------------

Fairly consistent.


Additional info:
---------------

[root@gqas014 tmp]# gluster v info
 
Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 22c652d8-0754-438a-8131-373bad7c12ab
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick4: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick8: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick10: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick14: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick16: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick20: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick21: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick22: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Options Reconfigured:
ganesha.enable: on
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 3 Daniel Gryniewicz 2017-06-30 16:31:55 UTC
So, I've looked this over, and have a few comments.

1. The core is inconsistent some way.  The line it supposedly crashed on is not a statement that can cause a SEGFAULT in C (the line is if (!fh_desc) )  This means something's wrong.  The debuginfo matches the binaries, so the only thing I can think of is that the binary doesn't match the core.  But the binary is 8 days old.  I also checked the one other thread doing anything, and it's source doesn't line up either (it's supposedly stopped on a brace)

2. The memory in both functions is all valid.  It can all be read.  This means, to me, that (1) is causing the location of the crash to be mis-reported, and it's actually crashing somewhere else.

Since this is reproducible, can it be reproduced again with a new set of core files?  Hopefully they'll match up better.

Comment 6 Daniel Gryniewicz 2017-07-05 12:47:59 UTC
For that core, I get this:

warning: exec file is newer than core file.

And there's no backtrace.  Presumably the change to 2.4.4-8 broke the backtrace.  Can you put the correct binaries back?  That might fix the core.  If not, you'll have to reproduce again.

Comment 9 Daniel Gryniewicz 2017-08-10 12:59:18 UTC
Note to self:  This looks like maybe the subhandle was released while the entry was valid.  Should not happen in >=2.5, due to chunked readdir.