Description of problem:
------------------------
4 node Ganesha cluster.2*2 volume mounted on 4 clients via v3 and v4.
*Workload* - Iozone reads from 4 clients,dd from 2 clients and linux untar from 2 clients in 2 different sub-directories.
Almost half an hour into the workload,Ganesha crashed on one of the nodes and dumped core.
(gdb) bt
#0 0x00007fbfa6ef1e60 in MDCACHE ()
#1 0x00007fbfa1b46708 in _gf_ref_put (ref=ref@entry=0x7fbe700396e8) at refcount.c:47
#2 0x00007fbf8f0b2132 in dht_inode_ctx_get_mig_info (this=this@entry=0x7fbf8800ea20, inode=0x7fbf7f2f3bac,
src_subvol=src_subvol@entry=0x0, dst_subvol=dst_subvol@entry=0x7fbf7fffe090) at dht-helper.c:243
#3 0x00007fbf8f10be9e in dht_flush_cbk (frame=0x7fbf9c8a5970, cookie=<optimized out>, this=0x7fbf8800ea20,
op_ret=0, op_errno=117, xdata=0x0) at dht-inode-read.c:715
#4 0x00007fbf8f380225 in afr_flush_cbk (frame=0x7fbf9c8486d0, cookie=<optimized out>, this=<optimized out>,
op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>) at afr-common.c:2961
#5 0x00007fbf8f5bfb26 in client3_3_flush_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>,
myframe=0x7fbf9c883464) at client-rpc-fops.c:921
#6 0x00007fbfa18a2680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fbf8809b5b0, pollin=pollin@entry=0x7fbf7a68ce30)
at rpc-clnt.c:791
#7 0x00007fbfa18a295f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fbf8809b5e0, event=<optimized out>,
data=0x7fbf7a68ce30) at rpc-clnt.c:962
#8 0x00007fbfa189e883 in rpc_transport_notify (this=this@entry=0x7fbf880ab2e0,
event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fbf7a68ce30) at rpc-transport.c:537
#9 0x00007fbf94421eb4 in socket_event_poll_in (this=this@entry=0x7fbf880ab2e0) at socket.c:2267
#10 0x00007fbf94424365 in socket_event_handler (fd=<optimized out>, idx=5, data=0x7fbf880ab2e0, poll_in=1,
poll_out=0, poll_err=0) at socket.c:2397
#11 0x00007fbfa1b323d0 in event_dispatch_epoll_handler (event=0x7fbf7fffe540, event_pool=0x7fbfa8dbb030)
at event-epoll.c:571
#12 event_dispatch_epoll_worker (data=0x7fbf8805db10) at event-epoll.c:674
#13 0x00007fbfa5139dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fbfa480873d in clone () from /lib64/libc.so.6
(gdb)
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
How reproducible:
----------------
Reporting the first occurence
Steps to Reproduce:
-------------------
1. Mount a 2*2 volume via v3 and v4 on different clients.
2. Run iozone reads and couple of writes- dd,iozone,untar etc.
Actual results:
---------------
Ganesha crashes and dumps core.
Expected results:
-----------------
No crashes.
Additional info:
----------------
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: aeab0f8a-1e34-4681-bdf4-5b1416e46f27
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas011 /]#
Comment 8Susant Kumar Palai
2016-11-29 11:32:19 UTC
From the core:
(gdb) p *ref
$28 = {cnt = 0, release = 0x7fbfa6ef1e60 <MDCACHE>, data = 0x7fbfa6c7ca20 <mdcache_get_ref>}
The ref count for miginfo object is zero. So it seems like double unref event. Will debug further from the code to figure out the RCA.