1394717 – [Tracker] : Ganesha crashes during I/O from multiple clients.

Bug 1394717 - [Tracker] : Ganesha crashes during I/O from multiple clients.

Summary: [Tracker] : Ganesha crashes during I/O from multiple clients.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Soumya Koduri
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:	1399138 1403665 1403666 1403670 1403698
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-11-14 10:03 UTC by Ambarish
Modified:	2017-03-28 06:55 UTC (History)
CC List:	14 users (show)
Fixed In Version:	nfs-ganesha-2.4.1-4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1403665 1403666 1403670 1403698 (view as bug list)
Environment:
Last Closed:	2017-03-23 06:25:09 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1394702	0	unspecified	CLOSED	[Perf] : EBADFD , Errors while acquiring/destroying mutex , Assertion failures and Attribute errors while I/O is run fro...	2023-09-14 03:34:20 UTC
Red Hat Product Errata	RHEA-2017:0493	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.2.0 nfs-ganesha bug fix and enhancement update	2017-03-23 09:19:13 UTC

Internal Links: 1394702

Description Ambarish 2016-11-14 10:03:47 UTC

Description of problem:
-----------------------

*Setup* - 4 node Ganesha cluster,2*2 volume,mounted via v3 and v4 on 8 clients.

*Workload Description* : I ran writes on all 8 clients,the description of which is under :

*FIO sequential writes* - Client1,Client2,Client3,Client4
*Smallfile creates *    - Client5,Client6,Client7,Client8
*dd(2500 1 MB files)*                   Client2,Client3,Client4,Client6,Client7,Client8

*Observation* : 

> Almost half an hour into the workload,Ganesha processed crashed on all the nodes,one by one.

> I/O was hung from the clients,since pacemaker quorum was not met.

> EBADFD and other error messages in server's ganesha logs(tracked via(https://bugzilla.redhat.com/show_bug.cgi?id=1394702))

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-4.el7rhgs.x86_64


How reproducible:
-----------------

Reporting the first occurence.

Steps to Reproduce:
------------------

1. Prepare 4 node Ganesha cluster,mount the volume via v3/v4 on 8 clients.
2. Run fio,smallfile creates,dds,kernel untart from various clients.
3. Check for crashes and error messages in logs


Actual results:
---------------

Ganesha Crashes,EBADFD and other error messages.

Expected results:
-----------------

No crashes/error messages

Additional info:
----------------

OS : RHEL 7.3

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: e84889ee-7bed-426f-b187-2b15fb244175
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 3 Ambarish 2016-11-14 10:09:30 UTC

BT from gqas011 :

(gdb) 
#0  0x00007faf743e4172 in list_del_init (old=0x7fade80dcfe0) at ../../../../libglusterfs/src/list.h:88
#1  __wb_request_unref (req=req@entry=0x7fade80dcfd0) at write-behind.c:354
#2  0x00007faf743e762f in __wb_fulfill_request (req=req@entry=0x7fade80dcfd0) at write-behind.c:672
#3  0x00007faf743e7938 in wb_head_done (head=0x7fade80dcfd0) at write-behind.c:826
#4  0x00007faf743e90df in wb_fulfill_cbk (frame=frame@entry=0x7faf7dc970ac, cookie=<optimized out>, this=<optimized out>, 
    op_ret=op_ret@entry=1, op_errno=op_errno@entry=0, prebuf=prebuf@entry=0x7faf65bf8288, postbuf=postbuf@entry=0x7faf65bf82f8, 
    xdata=xdata@entry=0x7faf7d3c2774) at write-behind.c:995
#5  0x00007faf7464b980 in dht_writev_cbk (frame=0x7faf7dc1e7f8, cookie=<optimized out>, this=<optimized out>, op_ret=1, op_errno=0, 
    prebuf=0x7faf65bf8288, postbuf=0x7faf65bf82f8, xdata=0x7faf7d3c2774) at dht-inode-write.c:106
#6  0x00007faf74897d15 in afr_writev_unwind (frame=0x7faf7dc8571c, this=<optimized out>) at afr-inode-write.c:247
#7  0x00007faf748981a9 in afr_writev_wind_cbk (frame=0x7faf7dc4720c, cookie=<optimized out>, this=0x7faf6800c550, 
    op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, postbuf=0x7faf658f0070, xdata=0x7faf7d400724)
    at afr-inode-write.c:395
#8  0x00007faf74b086c2 in client3_3_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7faf7dbedb0c) at client-rpc-fops.c:860
#9  0x00007faf82afe680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7faf680c42d0, pollin=pollin@entry=0x7faf59b8eab0) at rpc-clnt.c:791
#10 0x00007faf82afe93f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7faf680c4300, event=<optimized out>, data=0x7faf59b8eab0)
    at rpc-clnt.c:963
#11 0x00007faf82afa883 in rpc_transport_notify (this=this@entry=0x7faf680d4000, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, 
    data=data@entry=0x7faf59b8eab0) at rpc-transport.c:537
#12 0x00007faf74fbceb4 in socket_event_poll_in (this=this@entry=0x7faf680d4000) at socket.c:2267
#13 0x00007faf74fbf365 in socket_event_handler (fd=<optimized out>, idx=4, data=0x7faf680d4000, poll_in=1, poll_out=0, poll_err=0)
    at socket.c:2397
#14 0x00007faf82d8e3d0 in event_dispatch_epoll_handler (event=0x7faf658f0540, event_pool=0x7faf84086710) at event-epoll.c:571
#15 event_dispatch_epoll_worker (data=0x7faf6805db20) at event-epoll.c:674
#16 0x00007fb012a96dc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007fb01216573d in clone () from /lib64/libc.so.6
(gdb)

Comment 4 Soumya Koduri 2016-11-14 11:34:29 UTC

Bug1394702 and this bug could be related.

Out of 4 cores (observed on each node of 4-node cluster), 3 of them looks like due to resource crunch (PTHREAD_MUTEX_LOCK/UNLOCK failed).

[root@gqas013 ~]# tailf /var/log/ganesha.log
13/11/2016 13:16:51 : epoch 16f70000 : gqas013.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31855[work-188] nfs_dupreq_finish :RW LOCK :CRIT :Error 22, acquiring mutex 0x7f67c40d0980 (&dv->mtx) at /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1085


[root@gqas006 ~]# tailf /var/log/ganesha.log
13/11/2016 13:55:07 : epoch 0e630000 : gqas006.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31410[work-200] nfs_dupreq_free_dupreq :RW LOCK :CRIT :Error 16, Destroy mutex 0x7f5bf80417b0 (&dv->mtx) at /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:786


[root@gqas006 ~]# tailf /var/log/ganesha.log
13/11/2016 13:55:07 : epoch 0e630000 : gqas006.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31410[work-200] nfs_dupreq_free_dupreq :RW LOCK :CRIT :Error 16, Destroy mutex 0x7f5bf80417b0 (&dv->mtx) at /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:786

Also on these 4-nodes, I can see below msg just before crashes


Nov 13 13:13:14 gqas013 kernel: perf: interrupt took too long (2515 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Nov 13 13:16:29 gqas013 kernel: perf: interrupt took too long (3158 > 3143), lowering kernel.perf_event_max_sample_rate to 63000
Nov 13 13:16:56 gqas013 systemd: nfs-ganesha.service: main process exited, code=killed, status=6/ABRT
Nov 13 13:16:56 gqas013 systemd: Unit nfs-ganesha.service entered failed state.
Nov 13 13:16:56 gqas013 systemd: nfs-ganesha.service failed.
Nov 13 13:17:06 gqas013 crmd[32608]:  notice: Result of notif


Nov 13 13:48:38 gqas006 kernel: perf: interrupt took too long (4911 > 4907), lowering kernel.perf_event_max_sample_rate to 40000
Nov 13 13:55:14 gqas006 systemd: nfs-ganesha.service: main process exited, code=killed, status=6/ABRT
Nov 13 13:55:14 gqas006 systemd: Unit nfs-ganesha.service entered failed state.
Nov 13 13:55:14 gqas006 systemd: nfs-ganesha.service failed.
Nov 13 13:55:21 gqas006 crmd[31877]:  notice: Result of notify op


Nov 13 13:48:38 gqas006 kernel: perf: interrupt took too long (4911 > 4907), lowering kernel.perf_event_max_sample_rate to 40000
Nov 13 13:55:14 gqas006 systemd: nfs-ganesha.service: main process exited, code=killed, status=6/ABRT
Nov 13 13:55:14 gqas006 systemd: Unit nfs-ganesha.service entered failed state.
Nov 13 13:55:14 gqas006 systemd: nfs-ganesha.service failed.


Nov 13 13:21:18 gqas011 kernel: perf: interrupt took too long (3937 > 3935), lowering kernel.perf_event_max_sample_rate to 50000
Nov 13 13:38:56 gqas011 kernel: perf: interrupt took too long (4929 > 4921), lowering kernel.perf_event_max_sample_rate to 40000
Nov 13 13:40:42 gqas011 kernel: ganesha.nfsd[32418]: segfault at 7fade80dd7 ip 00007faf743e4172 sp 00007faf658efd70 error 6 in write-behind.so[7faf743e2000+e000]
Nov 13 13:40:55 gqas011 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV
Nov 13 13:40:57 gqas011 systemd: Unit nfs-ganesha.service entered failed state.
Nov 13 13:40:57 gqas011 systemd: nfs-ganesha.service failed.

Below msgs seem to get triggered when on high(er) system load or a cpu that is scaling. 
 >> perf: interrupt took too long (3937 > 3935), lowering kernel.perf_event_max_sample_rate to 50000

Considering the load/tests which were being run, I think we could ignore mutex lock failures for now. Please report if these errors are seen even with limited load.

So the issue which remains is the crash on the 4th node -


(gdb) bt
#0  0x00007faf743e4172 in list_del_init (old=0x7fade80dcfe0) at ../../../../libglusterfs/src/list.h:88
#1  __wb_request_unref (req=req@entry=0x7fade80dcfd0) at write-behind.c:354
#2  0x00007faf743e762f in __wb_fulfill_request (req=req@entry=0x7fade80dcfd0) at write-behind.c:672
#3  0x00007faf743e7938 in wb_head_done (head=0x7fade80dcfd0) at write-behind.c:826
#4  0x00007faf743e90df in wb_fulfill_cbk (frame=frame@entry=0x7faf7dc970ac, cookie=<optimized out>, 
    this=<optimized out>, op_ret=op_ret@entry=1, op_errno=op_errno@entry=0, prebuf=prebuf@entry=0x7faf65bf8288, 
    postbuf=postbuf@entry=0x7faf65bf82f8, xdata=xdata@entry=0x7faf7d3c2774) at write-behind.c:995
#5  0x00007faf7464b980 in dht_writev_cbk (frame=0x7faf7dc1e7f8, cookie=<optimized out>, this=<optimized out>, 
    op_ret=1, op_errno=0, prebuf=0x7faf65bf8288, postbuf=0x7faf65bf82f8, xdata=0x7faf7d3c2774)
    at dht-inode-write.c:106
#6  0x00007faf74897d15 in afr_writev_unwind (frame=0x7faf7dc8571c, this=<optimized out>) at afr-inode-write.c:247
#7  0x00007faf748981a9 in afr_writev_wind_cbk (frame=0x7faf7dc4720c, cookie=<optimized out>, this=0x7faf6800c550, 
    op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, postbuf=0x7faf658f0070, 
    xdata=0x7faf7d400724) at afr-inode-write.c:395
#8  0x00007faf74b086c2 in client3_3_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7faf7dbedb0c) at client-rpc-fops.c:860
#9  0x00007faf82afe680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7faf680c42d0, pollin=pollin@entry=0x7faf59b8eab0)
    at rpc-clnt.c:791
#10 0x00007faf82afe93f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7faf680c4300, event=<optimized out>, 
    data=0x7faf59b8eab0) at rpc-clnt.c:963
#11 0x00007faf82afa883 in rpc_transport_notify (this=this@entry=0x7faf680d4000, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7faf59b8eab0) at rpc-transport.c:537
#12 0x00007faf74fbceb4 in socket_event_poll_in (this=this@entry=0x7faf680d4000) at socket.c:2267
#13 0x00007faf74fbf365 in socket_event_handler (fd=<optimized out>, idx=4, data=0x7faf680d4000, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#14 0x00007faf82d8e3d0 in event_dispatch_epoll_handler (event=0x7faf658f0540, event_pool=0x7faf84086710)
    at event-epoll.c:571
#15 event_dispatch_epoll_worker (data=0x7faf6805db20) at event-epoll.c:674
#16 0x00007fb012a96dc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007fb01216573d in clone () from /lib64/libc.so.6
(gdb) 


There seems to be memory corruption. From gfapi logs -


[2016-11-13 18:40:42.178843] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d648) [0x7faf748c6648] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2016-11-13 18:40:42.196639] E [mem-pool.c:315:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x50) [0x7faf82d2cf00] -->/lib64/libglusterfs.so.0(data_destroy+0x5d) [0x7faf82d2c51d] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)


Not sure if the issue is with replicate or write-behind at this point. Requesting AFR team (Pranith, Ravishankar) to comment on the same.

Comment 5 Ravishankar N 2016-11-15 13:34:54 UTC

Some observations from the log and the core:

1. All afr related memory corruption messages in the gfapi log has is occurring in afr_local_cleanup() path during STACK_DESTROY. The first occurrence of the error messages is at [2016-11-13 18:25:17.207978]:

[2016-11-13 18:25:17.207978] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + hea    der->size)

2. I also see similar messages from libglusterfs code path:
[2016-11-13 18:25:17.407671] E [mem-pool.c:315:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x50) [0x7faf82d2cf00] -->/lib64/libglusterfs.so.0(data_destroy+0x5d) [0x7faf82d2c51d] -->/lib64/libgluste    rfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)


3. The core file seems to indicate corruption in write-behind.

4. The timing of the generation of the core file is around 18:40 UTC (13:40 EST), which is at a later time

[root@gqas011 tmp]# ll core.dump.PID=31407UID=0
-rwx------ 1 root root 3663110324 Nov 13 13:40 core.dump.PID=31407UID=0

I'm not sure at this point if AFR has caused the corruption. It would be good to see if we can get a consistent reproducer. Running the tests with an address sanitizer would also help. Let me know if I can help with asan.

Comment 7 Ambarish 2016-11-28 06:57:40 UTC

Soumya/Ravi,

Got a reproducer on the first try.
Ganesha crashed on 3 nodes with different BTs :


*******
NODE 1
*******

(gdb) bt
#0  glusterfs_open_my_fd (objhandle=objhandle@entry=0x7fe3ac078f00, openflags=openflags@entry=2, posix_flags=1, 
    my_fd=my_fd@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1029
#1  0x00007fe5c4548017 in glusterfs_open_func (obj_hdl=0x7fe3ac078f38, openflags=2, fd=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1126
#2  0x00007fe650c9c517 in fsal_reopen_obj (obj_hdl=obj_hdl@entry=0x7fe3ac078f38, 
    check_share=check_share@entry=false, bypass=bypass@entry=false, openflags=openflags@entry=2, 
    my_fd=my_fd@entry=0x7fe3ac078f28, share=share@entry=0x7fe3ac0791a8, 
    open_func=open_func@entry=0x7fe5c4547fd0 <glusterfs_open_func>, 
    close_func=close_func@entry=0x7fe5c4548100 <glusterfs_close_func>, out_fd=out_fd@entry=0x7fe64104bdc0, 
    has_lock=has_lock@entry=0x7fe64104bdbe, closefd=closefd@entry=0x7fe64104bdbf)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/commonlib.c:2513
#3  0x00007fe5c45483dd in glusterfs_commit2 (obj_hdl=0x7fe3ac078f38, offset=<optimized out>, len=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1952
#4  0x00007fe650d72956 in mdcache_commit2 (obj_hdl=0x7fe3ac071528, offset=<optimized out>, len=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:963
#5  0x00007fe650ca819f in fsal_commit (obj=obj@entry=0x7fe3ac071528, offset=0, len=0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:1951
#6  0x00007fe650cf783b in nfs3_commit (arg=0x7fe580278648, req=<optimized out>, res=0x7fe538002420)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs3_commit.c:95
#7  0x00007fe650cbf12c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fe580278460)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281
#8  0x00007fe650cc078a in worker_run (ctx=0x7fe652022ce0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#9  0x00007fe650d4a189 in fridgethr_start_routine (arg=0x7fe652022ce0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#10 0x00007fe64f22adc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fe64e8f973d in clone () from /lib64/libc.so.6
(gdb) 


*******
NODE 2
*******

(gdb) bt
#0  0x00007f5fdfd5a372 in fd_destroy (bound=_gf_true, fd=0x7f5cf40b748c) at fd.c:515
#1  fd_unref (fd=0x7f5cf40b748c) at fd.c:568
#2  0x00007f5fd19a4d78 in client_local_wipe (local=local@entry=0x7f5fc40949a8) at client-helpers.c:131
#3  0x00007f5fd19ae5ae in client3_3_finodelk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7f5fdaad1f0c) at client-rpc-fops.c:1606
#4  0x00007f5fdfb00680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f5fc409b5d0, pollin=pollin@entry=0x7f5fc588bd90)
    at rpc-clnt.c:791
#5  0x00007f5fdfb0095f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f5fc409b600, event=<optimized out>, 
    data=0x7f5fc588bd90) at rpc-clnt.c:962
#6  0x00007f5fdfafc883 in rpc_transport_notify (this=this@entry=0x7f5fc40ab300, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f5fc588bd90) at rpc-transport.c:537
#7  0x00007f5fd1e5eeb4 in socket_event_poll_in (this=this@entry=0x7f5fc40ab300) at socket.c:2267
#8  0x00007f5fd1e61365 in socket_event_handler (fd=<optimized out>, idx=5, data=0x7f5fc40ab300, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#9  0x00007f5fdfd903d0 in event_dispatch_epoll_handler (event=0x7f5fd286f540, event_pool=0x7f5fe0086710)
    at event-epoll.c:571
#10 event_dispatch_epoll_worker (data=0x7f5fcc000920) at event-epoll.c:674
#11 0x00007f606f932dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f606f00173d in clone () from /lib64/libc.so.6
(gdb) 




*******
NODE 3
*******

(gdb) bt
#0  0x00007f14578cbf59 in _int_malloc () from /lib64/libc.so.6
#1  0x00007f14578cea14 in calloc () from /lib64/libc.so.6
#2  0x00007f13cc6b85e8 in __gf_calloc (nmemb=nmemb@entry=1, size=<optimized out>, type=type@entry=155, 
    typestr=typestr@entry=0x7f13ba0bffeb "gf_afr_mt_char") at mem-pool.c:117
#3  0x00007f13ba0b3083 in afr_inodelk_init (lk=lk@entry=0x7f13ab2094d8, dom=<optimized out>, 
    child_count=<optimized out>) at afr-common.c:4718
#4  0x00007f13ba0b3273 in afr_transaction_local_init (local=local@entry=0x7f13ab208e80, 
    this=this@entry=0x7f13ac00c550) at afr-common.c:4742
#5  0x00007f13ba08b8d4 in afr_transaction (frame=frame@entry=0x7f13c33ed77c, this=this@entry=0x7f13ac00c550, 
    type=type@entry=AFR_DATA_TRANSACTION) at afr-transaction.c:2574
#6  0x00007f13ba07b4cb in afr_do_writev (frame=frame@entry=0x7f13c33d1e18, this=this@entry=0x7f13ac00c550)
    at afr-inode-write.c:489
#7  0x00007f13ba07be44 in afr_writev (frame=0x7f13c33d1e18, this=0x7f13ac00c550, fd=0x7f13a846f1dc, 
    vector=0x7f109e1fddb0, count=1, offset=3656175615, flags=0, iobref=0x7f11ec08a720, xdata=0x0)
    at afr-inode-write.c:559
#8  0x00007f13b9e30639 in dht_writev (frame=<optimized out>, this=<optimized out>, fd=0x7f13a846f1dc, 
    vector=0x7f109e1fddb0, count=1, off=3656175615, flags=0, iobref=0x7f11ec08a720, xdata=0x0)
    at dht-inode-write.c:192
#9  0x00007f13b9bcb0d0 in wb_fulfill_head (wb_inode=wb_inode@entry=0x7f13a0cf4050, head=0x7f11f0100600)
    at write-behind.c:1049
#10 0x00007f13b9bcb2cb in wb_fulfill (wb_inode=wb_inode@entry=0x7f13a0cf4050, 
    liabilities=liabilities@entry=0x7f109e1fdf10) at write-behind.c:1130
#11 0x00007f13b9bcc046 in wb_process_queue (wb_inode=wb_inode@entry=0x7f13a0cf4050) at write-behind.c:1550
#12 0x00007f13b9bcc734 in wb_writev (frame=0x7f13c347650c, this=<optimized out>, fd=<optimized out>, 
    vector=0x7f121c0444e0, count=1, offset=3656167423, flags=0, iobref=0x7f10ec02d4d0, xdata=0x0)
    at write-behind.c:1657
#13 0x00007f13b99bceeb in ra_writev (frame=0x7f13c33f166c, this=0x7f13ac011970, fd=0x7f13a846f1dc, 
    vector=0x7f121c0444e0, count=1, offset=3656167423, flags=0, iobref=0x7f10ec02d4d0, xdata=0x0)
    at read-ahead.c:684
#14 0x00007f13cc70bcf5 in default_writev (frame=0x7f13c33f166c, this=0x7f13ac012dc0, fd=0x7f13a846f1dc, 
    vector=0x7f121c0444e0, count=1, off=3656167423, flags=0, iobref=0x7f10ec02d4d0, xdata=0x0) at defaults.c:2543
---Type <return> to continue, or q <return> to quit---
#15 0x00007f13b95a2764 in ioc_writev (frame=0x7f13c33cf308, this=0x7f13ac014320, fd=0x7f13a846f1dc, 
    vector=0x7f121c0444e0, count=1, offset=3656167423, flags=0, iobref=0x7f10ec02d4d0, xdata=0x0) at io-cache.c:1263
#16 0x00007f13b9397189 in qr_writev (frame=0x7f13c33f00e4, this=0x7f13ac015a40, fd=0x7f13a846f1dc, 
    iov=0x7f121c0444e0, count=1, offset=3656167423, flags=0, iobref=0x7f10ec02d4d0, xdata=0x0) at quick-read.c:636
#17 0x00007f13cc7257ec in default_writev_resume (frame=0x7f13c3417b3c, this=0x7f13ac016f00, fd=0x7f13a846f1dc, 
    vector=0x7f121c0444e0, count=1, off=3656167423, flags=0, iobref=0x7f10ec02d4d0, xdata=0x0) at defaults.c:1849
#18 0x00007f13cc6b50d8 in call_resume_wind (stub=0x7f13c2e29a94) at call-stub.c:2045
#19 0x00007f13cc6b556d in call_resume (stub=0x7f13c2e29a94) at call-stub.c:2508
#20 0x00007f13b918d2e8 in open_and_resume (this=this@entry=0x7f13ac016f00, fd=fd@entry=0x7f13a846f1dc, 
    stub=0x7f13c2e29a94) at open-behind.c:245
#21 0x00007f13b918d350 in ob_writev (frame=0x7f13c3417b3c, this=0x7f13ac016f00, fd=0x7f13a846f1dc, 
    iov=<optimized out>, count=<optimized out>, offset=<optimized out>, flags=0, iobref=0x7f10ec02d4d0, xdata=0x0)
    at open-behind.c:423
#22 0x00007f13cc7257ec in default_writev_resume (frame=0x7f13c345a28c, this=0x7f13ac0183c0, fd=0x7f13a846f1dc, 
    vector=0x7f10ec05a910, count=1, off=3656167423, flags=0, iobref=0x7f10ec02d4d0, xdata=0x0) at defaults.c:1849
#23 0x00007f13cc6b50d8 in call_resume_wind (stub=0x7f13c2e52de4) at call-stub.c:2045
#24 0x00007f13cc6b556d in call_resume (stub=0x7f13c2e52de4) at call-stub.c:2508
#25 0x00007f13b8f83857 in iot_worker (data=0x7f13ac027840) at io-threads.c:220
#26 0x00007f1458276dc5 in start_thread () from /lib64/libpthread.so.0
#27 0x00007f145794573d in clone () from /lib64/libc.so.6
(gdb) 


Setup kept in same state for Dev to take a look.

Comment 23 Ambarish 2017-01-20 07:55:05 UTC

The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12 on two tries.

Will reopen if hit again during regressions.

Comment 25 errata-xmlrpc 2017-03-23 06:25:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html

Note You need to log in before you can comment on or make changes to this bug.