Description of problem: ------------------------ 4 node Ganesha cluster.2*2 volume mounted on 4 clients via v3 and v4. *Workload* - Iozone reads from 4 clients,dd from 2 clients and linux untar from 2 clients in 2 different sub-directories. Almost half an hour into the workload,Ganesha crashed on one of the nodes and dumped core. (gdb) bt #0 0x00007fbfa6ef1e60 in MDCACHE () #1 0x00007fbfa1b46708 in _gf_ref_put (ref=ref@entry=0x7fbe700396e8) at refcount.c:47 #2 0x00007fbf8f0b2132 in dht_inode_ctx_get_mig_info (this=this@entry=0x7fbf8800ea20, inode=0x7fbf7f2f3bac, src_subvol=src_subvol@entry=0x0, dst_subvol=dst_subvol@entry=0x7fbf7fffe090) at dht-helper.c:243 #3 0x00007fbf8f10be9e in dht_flush_cbk (frame=0x7fbf9c8a5970, cookie=<optimized out>, this=0x7fbf8800ea20, op_ret=0, op_errno=117, xdata=0x0) at dht-inode-read.c:715 #4 0x00007fbf8f380225 in afr_flush_cbk (frame=0x7fbf9c8486d0, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>) at afr-common.c:2961 #5 0x00007fbf8f5bfb26 in client3_3_flush_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fbf9c883464) at client-rpc-fops.c:921 #6 0x00007fbfa18a2680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fbf8809b5b0, pollin=pollin@entry=0x7fbf7a68ce30) at rpc-clnt.c:791 #7 0x00007fbfa18a295f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fbf8809b5e0, event=<optimized out>, data=0x7fbf7a68ce30) at rpc-clnt.c:962 #8 0x00007fbfa189e883 in rpc_transport_notify (this=this@entry=0x7fbf880ab2e0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fbf7a68ce30) at rpc-transport.c:537 #9 0x00007fbf94421eb4 in socket_event_poll_in (this=this@entry=0x7fbf880ab2e0) at socket.c:2267 #10 0x00007fbf94424365 in socket_event_handler (fd=<optimized out>, idx=5, data=0x7fbf880ab2e0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397 #11 0x00007fbfa1b323d0 in event_dispatch_epoll_handler (event=0x7fbf7fffe540, event_pool=0x7fbfa8dbb030) at event-epoll.c:571 #12 event_dispatch_epoll_worker (data=0x7fbf8805db10) at event-epoll.c:674 #13 0x00007fbfa5139dc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007fbfa480873d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 How reproducible: ---------------- Reporting the first occurence Steps to Reproduce: ------------------- 1. Mount a 2*2 volume via v3 and v4 on different clients. 2. Run iozone reads and couple of writes- dd,iozone,untar etc. Actual results: --------------- Ganesha crashes and dumps core. Expected results: ----------------- No crashes. Additional info: ---------------- Volume Name: testvol Type: Distributed-Replicate Volume ID: aeab0f8a-1e34-4681-bdf4-5b1416e46f27 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on server.allow-insecure: on performance.stat-prefetch: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas011 /]#
Putting needinfo on Susant & Du as well.
From the core: (gdb) p *ref $28 = {cnt = 0, release = 0x7fbfa6ef1e60 <MDCACHE>, data = 0x7fbfa6c7ca20 <mdcache_get_ref>} The ref count for miginfo object is zero. So it seems like double unref event. Will debug further from the code to figure out the RCA.