Bug 1398921

Summary: [Ganesha] : Ganesha crashes on reads and writes from heterogeneous clients (v3 and v4 mounts).
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED CURRENTRELEASE QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asoman, bturner, dang, ffilz, jthottan, mbenjamin, nbalacha, rgowdapp, rhinduja, rhs-bugs, skoduri, spalai, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rhgs-3.2.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-23 12:28:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ambarish 2016-11-27 10:45:07 UTC
Description of problem:
------------------------

4 node Ganesha cluster.2*2 volume mounted on 4 clients via v3 and v4.

*Workload* - Iozone reads from 4 clients,dd from 2 clients and linux untar from 2 clients in 2 different sub-directories.

Almost half an hour into the workload,Ganesha crashed on one of the nodes and dumped core.

(gdb) bt
#0  0x00007fbfa6ef1e60 in MDCACHE ()
#1  0x00007fbfa1b46708 in _gf_ref_put (ref=ref@entry=0x7fbe700396e8) at refcount.c:47
#2  0x00007fbf8f0b2132 in dht_inode_ctx_get_mig_info (this=this@entry=0x7fbf8800ea20, inode=0x7fbf7f2f3bac, 
    src_subvol=src_subvol@entry=0x0, dst_subvol=dst_subvol@entry=0x7fbf7fffe090) at dht-helper.c:243
#3  0x00007fbf8f10be9e in dht_flush_cbk (frame=0x7fbf9c8a5970, cookie=<optimized out>, this=0x7fbf8800ea20, 
    op_ret=0, op_errno=117, xdata=0x0) at dht-inode-read.c:715
#4  0x00007fbf8f380225 in afr_flush_cbk (frame=0x7fbf9c8486d0, cookie=<optimized out>, this=<optimized out>, 
    op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>) at afr-common.c:2961
#5  0x00007fbf8f5bfb26 in client3_3_flush_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7fbf9c883464) at client-rpc-fops.c:921
#6  0x00007fbfa18a2680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fbf8809b5b0, pollin=pollin@entry=0x7fbf7a68ce30)
    at rpc-clnt.c:791
#7  0x00007fbfa18a295f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fbf8809b5e0, event=<optimized out>, 
    data=0x7fbf7a68ce30) at rpc-clnt.c:962
#8  0x00007fbfa189e883 in rpc_transport_notify (this=this@entry=0x7fbf880ab2e0, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fbf7a68ce30) at rpc-transport.c:537
#9  0x00007fbf94421eb4 in socket_event_poll_in (this=this@entry=0x7fbf880ab2e0) at socket.c:2267
#10 0x00007fbf94424365 in socket_event_handler (fd=<optimized out>, idx=5, data=0x7fbf880ab2e0, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#11 0x00007fbfa1b323d0 in event_dispatch_epoll_handler (event=0x7fbf7fffe540, event_pool=0x7fbfa8dbb030)
    at event-epoll.c:571
#12 event_dispatch_epoll_worker (data=0x7fbf8805db10) at event-epoll.c:674
#13 0x00007fbfa5139dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fbfa480873d in clone () from /lib64/libc.so.6
(gdb) 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64


How reproducible:
----------------

Reporting the first occurence

Steps to Reproduce:
-------------------

1. Mount a 2*2 volume via v3 and v4 on different clients.

2. Run iozone reads and couple of writes- dd,iozone,untar etc.

Actual results:
---------------

Ganesha crashes and dumps core.

Expected results:
-----------------

No crashes.

Additional info:
----------------

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: aeab0f8a-1e34-4681-bdf4-5b1416e46f27
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas011 /]#

Comment 5 Atin Mukherjee 2016-11-28 15:32:17 UTC
Putting needinfo on Susant & Du as well.

Comment 8 Susant Kumar Palai 2016-11-29 11:32:19 UTC
From the core:
(gdb) p *ref
$28 = {cnt = 0, release = 0x7fbfa6ef1e60 <MDCACHE>, data = 0x7fbfa6c7ca20 <mdcache_get_ref>}


 The ref count for miginfo object is zero. So it seems like double unref event. Will debug further from the code to figure out the RCA.