Bug 1466988

Summary: [Ganesha] : Ganesha crashed on writes during __inode_destroy
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED WONTFIX QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, rcyriac, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 07:09:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ambarish 2017-07-01 06:13:51 UTC
Description of problem:
-----------------------

2 node cluster,4 clients writing in their specific subdirectories (using Bonnie,dbench,kernel untar).

Ganesha crashed multiple times with the same BT on one of my nodes :

(gdb) bt
#0  __inode_ctx_free (inode=inode@entry=0x7f761000d6b0) at inode.c:331
#1  0x00007f78e3e69212 in __inode_destroy (inode=0x7f761000d6b0) at inode.c:353
#2  inode_table_prune (table=table@entry=0x7f78cc0f3410) at inode.c:1543
#3  0x00007f78e3e694f4 in inode_unref (inode=0x7f761000d6b0) at inode.c:524
#4  0x00007f78e3e58092 in loc_wipe (loc=loc@entry=0x7f77281212a0) at xlator.c:748
#5  0x00007f78d2ebc2bd in ec_fop_data_release (fop=0x7f7728120fe0) at ec-data.c:302
#6  0x00007f78d2ebec18 in ec_resume (fop=<optimized out>, error=<optimized out>) at ec-common.c:337
#7  0x00007f78d2ec0acb in ec_lock_assign_owner (link=0x7f7634175940) at ec-common.c:1710
#8  ec_lock (fop=0x7f76341758c0) at ec-common.c:1788
#9  0x00007f78d2ecc84f in ec_manager_opendir (fop=0x7f76341758c0, state=<optimized out>) at ec-dir-read.c:144
#10 0x00007f78d2ebea2b in __ec_manager (fop=0x7f76341758c0, error=0) at ec-common.c:2381
#11 0x00007f78d2eb867d in ec_gf_opendir (frame=<optimized out>, this=<optimized out>, loc=<optimized out>, 
    fd=<optimized out>, xdata=<optimized out>) at ec.c:952
#12 0x00007f78d2c693d7 in dht_opendir (frame=frame@entry=0x7f763401b070, this=this@entry=0x7f78cc03bd90, 
    loc=loc@entry=0x7f774c002080, fd=fd@entry=0x7f774c000f90, xdata=xdata@entry=0x7f763412f5d0) at dht-common.c:4960
#13 0x00007f78e3ed564b in default_opendir (frame=frame@entry=0x7f763401b070, this=this@entry=0x7f78cc03d6c0, 
    loc=loc@entry=0x7f774c002080, fd=fd@entry=0x7f774c000f90, xdata=xdata@entry=0x7f763412f5d0) at defaults.c:2956
#14 0x00007f78e3ed564b in default_opendir (frame=0x7f763401b070, this=<optimized out>, loc=0x7f774c002080, 
    fd=0x7f774c000f90, xdata=0x7f763412f5d0) at defaults.c:2956
#15 0x00007f78d2603453 in rda_opendir (frame=frame@entry=0x7f763402c850, this=this@entry=0x7f78cc040670, 
    loc=loc@entry=0x7f774c002080, fd=fd@entry=0x7f774c000f90, xdata=xdata@entry=0x7f763412f5d0)
    at readdir-ahead.c:570
#16 0x00007f78e3ed564b in default_opendir (frame=frame@entry=0x7f763402c850, this=this@entry=0x7f78cc0421a0, 
    loc=loc@entry=0x7f774c002080, fd=fd@entry=0x7f774c000f90, xdata=xdata@entry=0x7f763412f5d0) at defaults.c:2956
#17 0x00007f78e3ed564b in default_opendir (frame=frame@entry=0x7f763402c850, this=this@entry=0x7f78cc043a40, 
    loc=loc@entry=0x7f774c002080, fd=fd@entry=0x7f774c000f90, xdata=xdata@entry=0x7f763412f5d0) at defaults.c:2956
#18 0x00007f78e3ed564b in default_opendir (frame=0x7f763402c850, this=<optimized out>, loc=0x7f774c002080, 
    fd=0x7f774c000f90, xdata=0x7f763412f5d0) at defaults.c:2956
#19 0x00007f78d1dc6bb7 in mdc_opendir (frame=0x7f763419a630, this=0x7f78cc0465e0, loc=0x7f774c002080, 
    fd=0x7f774c000f90, xdata=0x7f763412f5d0) at md-cache.c:2322
#20 0x00007f78e3ef0b8a in default_opendir_resume (frame=0x7f774c010c00, this=0x7f78cc047f20, loc=0x7f774c002080, 
    fd=0x7f774c000f90, xdata=0x0) at defaults.c:2181
#21 0x00007f78e3e7d125 in call_resume (stub=0x7f774c002030) at call-stub.c:2508
#22 0x00007f78d1bbd957 in iot_worker (data=0x7f78cc057ad0) at io-threads.c:220
#23 0x00007f78e74c1e25 in start_thread (arg=0x7f78c01b7700) at pthread_create.c:308
#24 0x00007f78e6b8f34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

nfs-ganesha-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64


How reproducible:
-----------------

Fairly


Additional info:
----------------

[root@gqas007 tmp]# gluster v info
 
Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 22c652d8-0754-438a-8131-373bad7c12ab
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick4: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick8: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick10: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick14: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick16: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick20: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick21: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick22: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Options Reconfigured:
ganesha.enable: on
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 6 Daniel Gryniewicz 2017-07-05 13:17:31 UTC
Could be memory corruption, could be a refcount error.  I'm not familiar enough with the Gluster codebase to know.

(Note, this entire backtrace in in Gluster code, no Ganesha code appears in it.)