Bug 1466994

Summary: [Ganesha]: Ganesha crashed during IO (in get_state_owner_ref)
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED WONTFIX QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, rcyriac, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 07:09:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ambarish 2017-07-01 11:53:16 UTC
Description of problem:
-----------------------

2 node Ganesha cluster.

4 clients mounted the volume via v4 and ran Bonnie,dbench,iozone and kernel untar.

Ganesha crashed on one of my nodes with the following BT :

(gdb) bt
#0  0x00007fe1298651f7 in raise () from /lib64/libc.so.6
#1  0x00007fe1298668e8 in abort () from /lib64/libc.so.6
#2  0x00005611cd514127 in get_state_owner_ref (state=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/include/sal_functions.h:775
#3  nfs4_op_close (op=0x7fe068002760, data=0x7fe0b7785180, resp=0x7fde4c05a9c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_close.c:217
#4  0x00005611cd51297d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fde4c0439b0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#5  0x00005611cd503b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fe0680008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#6  0x00005611cd50518a in worker_run (ctx=0x5611ce53bad0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#7  0x00005611cd58e889 in fridgethr_start_routine (arg=0x5611ce53bad0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#8  0x00007fe12a25ae25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fe12992834d in clone () from /lib64/libc.so.6
(gdb) 



Version-Release number of selected component (if applicable):
------------------------------------------------------------

nfs-ganesha-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-32.el7rhgs.x86_64


How reproducible:
-----------------

Reporting the first occurence.


Additional info:
---------------

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 6ade5657-45e2-43c7-8098-774417789a5e
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas013 tmp]#

Comment 4 Daniel Gryniewicz 2017-07-05 13:14:14 UTC
So, this is an error returned from pthread_mutex_lock().  Probably, this means that the owner has already been freed, but I can't tell without access to the core to look at it.

There have been some lifecycle changes to SAL objects on 2.5 that were not backported that may cause this.  Can't tell for sure.

About the workload:  Did you run those workloads in sequence?  Or was one client running each type of workload?  If you can characterize the workload better, I can attempt to reproduce.