1467016 – [Ganesha] : Ganesha crashes on writes (AFR),possible memory corruption

Bug 1467016 - [Ganesha] : Ganesha crashes on writes (AFR),possible memory corruption

Summary: [Ganesha] : Ganesha crashes on writes (AFR),possible memory corruption

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Kaleb KEITHLEY
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-01 17:32 UTC by Ambarish
Modified:	2017-08-10 07:09 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-10 07:09:43 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ambarish 2017-07-01 17:32:12 UTC

Description of problem:
-----------------------

This is different than https://bugzilla.redhat.com/show_bug.cgi?id=1466988.

Not just coz the BTs are different(one is EC , one is AFR),but also I think https://bugzilla.redhat.com/show_bug.cgi?id=1466988 happened in opendir path. 

Plz feel free to close as DUP,if that's not the case.

Use Case- 2 node cluster,4 clients writing in their specific subdirectories (using Bonnie,dbench,kernel untar).

Ganesha crashed on one of my nodes with the following BT :

(gdb) bt
#0  __inode_ctx_free (inode=inode@entry=0x7f2f440025c0) at inode.c:331
#1  0x00007f32108651d2 in __inode_destroy (inode=0x7f2f440025c0) at inode.c:353
#2  inode_table_prune (table=table@entry=0x7f31f806afc0) at inode.c:1543
#3  0x00007f32108654b4 in inode_unref (inode=0x7f2f440025c0) at inode.c:524
#4  0x00007f31ff90d4a5 in afr_local_cleanup (local=0x7f31f81449c0, this=<optimized out>) at afr-common.c:1790
#5  0x00007f31ff8ea0fc in __afr_txn_write_done (frame=<optimized out>, this=<optimized out>) at afr-transaction.c:198
#6  0x00007f31ff8ef0eb in afr_unlock_common_cbk (frame=frame@entry=0x7f31f82704e0, this=this@entry=0x7f31f8011ce0, xdata=0x0, op_errno=<optimized out>, op_ret=<optimized out>, cookie=<optimized out>)
    at afr-lk-common.c:633
#7  0x00007f31ff8ef9e4 in afr_unlock_entrylk_cbk (frame=0x7f31f82704e0, cookie=<optimized out>, this=0x7f31f8011ce0, op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>)
    at afr-lk-common.c:829
#8  0x00007f31ffb526eb in client3_3_entrylk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f31ec0f0760) at client-rpc-fops.c:1657
#9  0x00007f321061f840 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f31f80570e0, pollin=pollin@entry=0x7f31e8002440) at rpc-clnt.c:794
#10 0x00007f321061fb27 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f31f8057110, event=<optimized out>, data=0x7f31e8002440) at rpc-clnt.c:987
#11 0x00007f321061b9e3 in rpc_transport_notify (this=this@entry=0x7f31f8057280, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f31e8002440) at rpc-transport.c:538
#12 0x00007f32040953d6 in socket_event_poll_in (this=this@entry=0x7f31f8057280, notify_handled=<optimized out>) at socket.c:2306
#13 0x00007f320409797c in socket_event_handler (fd=10, idx=5, gen=1, data=0x7f31f8057280, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#14 0x00007f32108b0776 in event_dispatch_epoll_handler (event=0x7f31fdaa7540, event_pool=0x5601b972c010) at event-epoll.c:572
#15 event_dispatch_epoll_worker (data=0x7f31f80524c0) at event-epoll.c:648
#16 0x00007f3213ebde25 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f321358b34d in clone () from /lib64/libc.so.6
(gdb) 




Version-Release number of selected component (if applicable):
--------------------------------------------------------------

nfs-ganesha-gluster-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-32.el7rhgs.x86_64


How reproducible:
-----------------

Reporting the first occurence.


Additional info:
----------------

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 6ade5657-45e2-43c7-8098-774417789a5e
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas005 tmp]#

Comment 4 Daniel Gryniewicz 2017-07-05 13:03:06 UTC

So, this backtrace does not include any Ganesha code at all, it's entirely in Gluster code.  That said, if it's memory corruption, it's likely the same issue as the rest.

Note You need to log in before you can comment on or make changes to this bug.