Bug 1434332

Summary: crash from write-behind
Product: [Community] GlusterFS Reporter: Ravishankar N <ravishankar>
Component: write-behindAssignee: Raghavendra G <rgowdapp>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: bugs, jahernan, rgowdapp
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-20 10:32:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ravishankar N 2017-03-21 10:00:47 UTC
Description of problem:

Core file created while running tests/basic/afr-heal-quota.t

Version-Release number of selected component (if applicable):
master

How reproducible:
Only once. Not able to recreate.

127:root@vm4 regression$ gdb -ex 'set sysroot ./' -ex 'core-file build/install/cores/glusterfs-7968.core' ./build/install/sbin/glusterfsd

(gdb) bt
#0  0x00007fa6bebe9470 in pthread_mutex_lock () from ./lib64/libpthread.so.0
#1  0x00007fa6b1688b56 in wb_add_head_for_retry (head=0x7fa69c083700) at /home/jenkins/root/workspace/centos6-regression/xlators/performance/write-behind/src/write-behind.c:814
#2  0x00007fa6b16891fd in wb_fulfill_short_write (head=0x7fa69c083700, size=0) at /home/jenkins/root/workspace/centos6-regression/xlators/performance/write-behind/src/write-behind.c:1031
#3  0x00007fa6b1689296 in wb_fulfill_cbk (frame=0x7fa6a4014a30, cookie=0x7fa6a4005ea0, this=0x7fa6ac00e960, op_ret=106496, op_errno=0, prebuf=0x7fa6a401c3e0, postbuf=0x7fa6a401c450, xdata=0x7fa6a400f760)
    at /home/jenkins/root/workspace/centos6-regression/xlators/performance/write-behind/src/write-behind.c:1052
#4  0x00007fa6b1914100 in dht_writev_cbk (frame=0x7fa6a4005ea0, cookie=0x7fa6ac00b920, this=0x7fa6ac00d2e0, op_ret=106496, op_errno=0, prebuf=0x7fa6a401c3e0, postbuf=0x7fa6a401c450, xdata=0x7fa6a400f760)
    at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/dht/src/dht-inode-write.c:106
#5  0x00007fa6b1b760d5 in afr_writev_unwind (frame=0x7fa6a400b410, this=0x7fa6ac00b920) at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/afr/src/afr-inode-write.c:247
#6  0x00007fa6b1b765af in afr_writev_wind_cbk (frame=0x7fa6a4005260, cookie=0x1, this=0x7fa6ac00b920, op_ret=106496, op_errno=0, prebuf=0x7fa6ab6fe860, postbuf=0x7fa6ab6fe7f0, xdata=0x7fa6a400f760)
    at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/afr/src/afr-inode-write.c:394
#7  0x00007fa6b1e14a33 in client3_3_writev_cbk (req=0x7fa6a4001820, iov=0x7fa6a4001860, count=1, myframe=0x7fa6a400bda0) at /home/jenkins/root/workspace/centos6-regression/xlators/protocol/client/src/client-rpc-fops.c:858
#8  0x00007fa6bf6d470e in rpc_clnt_handle_reply (clnt=0x7fa6ac049d90, pollin=0x7fa6a4016e10) at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-lib/src/rpc-clnt.c:793
#9  0x00007fa6bf6d4cd8 in rpc_clnt_notify (trans=0x7fa6ac049f90, mydata=0x7fa6ac049dc0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7fa6a4016e10) at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-lib/src/rpc-clnt.c:986
#10 0x00007fa6bf6d0d6c in rpc_transport_notify (this=0x7fa6ac049f90, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7fa6a4016e10) at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-lib/src/rpc-transport.c:538
#11 0x00007fa6b42872d2 in socket_event_poll_in (this=0x7fa6ac049f90) at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-transport/socket/src/socket.c:2273
#12 0x00007fa6b428787f in socket_event_handler (fd=10, idx=3, data=0x7fa6ac049f90, poll_in=1, poll_out=0, poll_err=0) at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-transport/socket/src/socket.c:2409
#13 0x00007fa6bf97f13a in event_dispatch_epoll_handler (event_pool=0x202eef0, event=0x7fa6ab6fee70) at /home/jenkins/root/workspace/centos6-regression/libglusterfs/src/event-epoll.c:572
#14 0x00007fa6bf97f566 in event_dispatch_epoll_worker (data=0x7fa6ac049ad0) at /home/jenkins/root/workspace/centos6-regression/libglusterfs/src/event-epoll.c:675
#15 0x00007fa6bebe7aa1 in start_thread () from ./lib64/libpthread.so.0
#16 0x00007fa6be550aad in clone () from ./lib64/libc.so.6
(gdb) #
(gdb)
(gdb)
(gdb) f 1
#1  0x00007fa6b1688b56 in wb_add_head_for_retry (head=0x7fa69c083700) at /home/jenkins/root/workspace/centos6-regression/xlators/performance/write-behind/src/write-behind.c:814
warning: Source file is more recent than executable.
814             LOCK (&head->wb_inode->lock);
(gdb) l
809     wb_add_head_for_retry (wb_request_t *head)
810     {
811             if (!head)
812                     goto out;
813
814             LOCK (&head->wb_inode->lock);
815             {
816                     __wb_add_head_for_retry (head);
817             }
818             UNLOCK (&head->wb_inode->lock);
(gdb) p head->wb_inode->lock
$1 = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}
(gdb)

Everything seems to be zero in the lock structure.

Comment 1 Ravishankar N 2017-03-21 10:08:28 UTC
Forgot to add the link to the run: https://build.gluster.org/job/centos6-regression/3707/consoleFull

Comment 2 Amar Tumballi 2019-05-10 13:06:11 UTC
Have not seen it in a long time. But would be good to check this out with latest master and close it.

Comment 3 Raghavendra G 2019-05-10 14:56:01 UTC
looks to be a dup of bz 1528558. Particularly the commit msg of the patch exactly says how a corrupted/freedup request can end up in todo list:

COMMIT: https://review.gluster.org/19064 committed in master by \"Raghavendra G\" <rgowdapp> with a commit message- performance/write-behind: fix bug while handling short writes

The variabled "fulfilled" in wb_fulfill_short_write is not reset to 0
while handling every member of the list.

This has some interesting consequences:

* If we break from the loop while processing last member of the list
  head->winds, req is reset to head as the list is a circular
  one. However, head is already fulfilled and can potentially be
  freed. So, we end up adding a freed request to wb_inode->todo
  list. This is the RCA for the crash tracked by the bug associated
  with this patch (Note that we saw "holder" which is freed in todo
  list).

* If we break from the loop while processing any of the last but one
  member of the list head->winds, req is set to next member in the
  list, skipping the current request, even though it is not entirely
  synced. This can lead to data corruption.

The fix is very simple and we've to change the code to make sure
"fulfilled" reflects whether the current request is fulfilled or not
and it doesn't carry history of previous requests in the list.

Change-Id: Ia3d6988175a51c9e08efdb521a7b7938b01f93c8
BUG: 1528558
Signed-off-by: Raghavendra G <rgowdapp>

Comment 4 Xavi Hernandez 2019-11-20 10:32:02 UTC
Given that the failure has not been seen for a long time and there's a patch that seems to match the problem, I'm closing this bug as a duplicate of bug #1528558

*** This bug has been marked as a duplicate of bug 1528558 ***