Bug 1466994 - [Ganesha]: Ganesha crashed during IO (in get_state_owner_ref)
[Ganesha]: Ganesha crashed during IO (in get_state_owner_ref)
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha (Show other bugs)
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kaleb KEITHLEY
Depends On:
  Show dependency treegraph
Reported: 2017-07-01 07:53 EDT by Ambarish
Modified: 2017-08-10 03:09 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2017-08-10 03:09:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Ambarish 2017-07-01 07:53:16 EDT
Description of problem:

2 node Ganesha cluster.

4 clients mounted the volume via v4 and ran Bonnie,dbench,iozone and kernel untar.

Ganesha crashed on one of my nodes with the following BT :

(gdb) bt
#0  0x00007fe1298651f7 in raise () from /lib64/libc.so.6
#1  0x00007fe1298668e8 in abort () from /lib64/libc.so.6
#2  0x00005611cd514127 in get_state_owner_ref (state=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/include/sal_functions.h:775
#3  nfs4_op_close (op=0x7fe068002760, data=0x7fe0b7785180, resp=0x7fde4c05a9c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_close.c:217
#4  0x00005611cd51297d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fde4c0439b0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#5  0x00005611cd503b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fe0680008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#6  0x00005611cd50518a in worker_run (ctx=0x5611ce53bad0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#7  0x00005611cd58e889 in fridgethr_start_routine (arg=0x5611ce53bad0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#8  0x00007fe12a25ae25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fe12992834d in clone () from /lib64/libc.so.6

Version-Release number of selected component (if applicable):


How reproducible:

Reporting the first occurence.

Additional info:

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 6ade5657-45e2-43c7-8098-774417789a5e
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas013 tmp]#
Comment 4 Daniel Gryniewicz 2017-07-05 09:14:14 EDT
So, this is an error returned from pthread_mutex_lock().  Probably, this means that the owner has already been freed, but I can't tell without access to the core to look at it.

There have been some lifecycle changes to SAL objects on 2.5 that were not backported that may cause this.  Can't tell for sure.

About the workload:  Did you run those workloads in sequence?  Or was one client running each type of workload?  If you can characterize the workload better, I can attempt to reproduce.

Note You need to log in before you can comment on or make changes to this bug.