Description of problem:
-----------------------
2 node Ganesha cluster.
4 clients mounted the volume via v4 and ran Bonnie,dbench,iozone and kernel untar.
Ganesha crashed on one of my nodes with the following BT :
(gdb) bt
#0 0x00007fe1298651f7 in raise () from /lib64/libc.so.6
#1 0x00007fe1298668e8 in abort () from /lib64/libc.so.6
#2 0x00005611cd514127 in get_state_owner_ref (state=<optimized out>)
at /usr/src/debug/nfs-ganesha-2.4.4/src/include/sal_functions.h:775
#3 nfs4_op_close (op=0x7fe068002760, data=0x7fe0b7785180, resp=0x7fde4c05a9c0)
at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_close.c:217
#4 0x00005611cd51297d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fde4c0439b0)
at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#5 0x00005611cd503b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fe0680008c0)
at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#6 0x00005611cd50518a in worker_run (ctx=0x5611ce53bad0)
at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#7 0x00005611cd58e889 in fridgethr_start_routine (arg=0x5611ce53bad0)
at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#8 0x00007fe12a25ae25 in start_thread () from /lib64/libpthread.so.0
#9 0x00007fe12992834d in clone () from /lib64/libc.so.6
(gdb)
Version-Release number of selected component (if applicable):
------------------------------------------------------------
nfs-ganesha-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-32.el7rhgs.x86_64
How reproducible:
-----------------
Reporting the first occurence.
Additional info:
---------------
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 6ade5657-45e2-43c7-8098-774417789a5e
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas013 tmp]#
Comment 4Daniel Gryniewicz
2017-07-05 13:14:14 UTC
So, this is an error returned from pthread_mutex_lock(). Probably, this means that the owner has already been freed, but I can't tell without access to the core to look at it.
There have been some lifecycle changes to SAL objects on 2.5 that were not backported that may cause this. Can't tell for sure.
About the workload: Did you run those workloads in sequence? Or was one client running each type of workload? If you can characterize the workload better, I can attempt to reproduce.