1486538 – [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk during rename/hardlink/rebalance cases

Bug 1486538 - [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk during rename/hardlink/rebalance cases

Summary: [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk during rename/hard...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	3.12
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Nithya Balachandran
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1484709
Blocks:	1484113 glusterfs-3.12.1
TreeView+	depends on / blocked

Reported:	2017-08-30 04:26 UTC by Nithya Balachandran
Modified:	2017-09-14 07:42 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-glusterfs-3.12.1
Clone Of:	1484709
Environment:
Last Closed:	2017-09-14 07:42:56 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nithya Balachandran 2017-08-30 04:26:25 UTC

+++ This bug was initially created as a clone of Bug #1484709 +++

+++ This bug was initially created as a clone of Bug #1484113 +++

Description of problem:
=======================

While trying rebalance as part of remove brick containing hardlinks/rename io pattern resulted in multiple crashes at slave with the following bt: 

(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0, cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117, inode_ret=0x7f212c05bb60, 
    buf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=0x7f212c05bb60, stbuf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete (this=this@entry=0x7f2140016f40, discover_frame=discover_frame@entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>, frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame@entry=0x7f212c0546c0, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode@entry=0x7f212c05bb60, buf=buf@entry=0x7f2147537920, xdata=0x7f21400adca0, postparent=postparent@entry=0x7f2147537990) at afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380) at client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f2140080400, pollin=pollin@entry=0x7f21400ee830) at rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this@entry=0x7f21400805a0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this@entry=0x7f21400805a0, notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1, data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80, event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6
(gdb) 

Similar bug was seen when Master volume was tiered and slave was DR as mentioned in BZ 1291617 . However this bug was verified after retrial of 3 times on the same setup. 

Raising this as new bug without tier volume and with heavy IO pattern. 




How reproducible:
=================

2/2


Steps Carried:
==============
1. Create Master volume (5x2) and Slave volume (4x2)
2. Setup geo-rep session between master and slave
3. Mount the Master volume and start the following IO:

for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,rename,chown,rename,create,hardlink,hardlink,symlink,rename}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i /mnt/master_nfs/b ; sleep 10 ; done

4. Wait for couple of mins for geo-rep to start syncing these files to slave. 
5. While syncing is in progress, remove one subvolume from slave volume (3x2) via rebalance start 
6. After about 10 mins, remove one subvolume from Master volume too (4x2)
7. Keep checking rebalance status, once it is completed at slave, commit the remove brick
8. Keep checking rebalance status, once it is completed at Master, stop the geo-replication, commit the remove brick and start the geo-replication. 


Actual results:
===============

Crashes observed at slave


Expected results:
=================

No crashes should be seen


--- Additional comment from Nithya Balachandran on 2017-08-23 06:09:56 EDT ---

Here is what I have so far:

gdb -c /core.4537 /usr/sbin/glusterfs

Core was generated by `/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.15.2-50.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0, cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117, 
    inode_ret=0x7f212c05bb60, buf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, 
    op_errno=<optimized out>, inode=0x7f212c05bb60, stbuf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete (this=this@entry=0x7f2140016f40, discover_frame=discover_frame@entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>, frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame@entry=0x7f212c0546c0, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode@entry=0x7f212c05bb60, buf=buf@entry=0x7f2147537920, xdata=0x7f21400adca0, postparent=postparent@entry=0x7f2147537990) at afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f2140080400, pollin=pollin@entry=0x7f21400ee830) at rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this@entry=0x7f21400805a0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this@entry=0x7f21400805a0, notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1, data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80, event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6
(gdb) 

The process has crashed in __memcpy_sse2 which was called in qr_content_extract.

(gdb) f 2
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
279		memcpy (content, data->data, data->len);
(gdb) p *data
$1 = {is_static = 0 '\000', is_const = 0 '\000', len = -1162167622, data = 0x7f2134000cf0 "0", refcount = 0, lock = {spinlock = 0, mutex = {__data = {
        __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}}
(gdb) p/x *data
$2 = {is_static = 0x0, is_const = 0x0, len = 0xbabababa, data = 0x7f2134000cf0, refcount = 0x0, lock = {spinlock = 0x0, mutex = {__data = {__lock = 0x0, 
        __count = 0x0, __owner = 0x0, __nusers = 0x0, __kind = 0xffffffff, __spins = 0x0, __elision = 0x0, __list = {__prev = 0x0, __next = 0x0}}, __size = {
        0x0 <repeats 16 times>, 0xff, 0xff, 0xff, 0xff, 0x0 <repeats 20 times>}, __align = 0x0}}}
(gdb)

This data struct looks like it has already been freed. It looks like the memcpy has the wrong len (data->len = 0xbabababa ) which causes the crash.

Analysing the core
-------------------

First, check if the client returns the correct xdata:

(gdb) f 9
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
2947	        CLIENT_STACK_UNWIND (lookup, frame, rsp.op_ret, rsp.op_errno, inode,

(gdb) p rsp.op_ret
$3 = 0
(gdb) p *xdata
$4 = {is_static = 0 '\000', hash_size = 1, count = 5, refcount = 2, members = 0x7f21400adcf8, members_list = 0x7f21400ce980, extra_free = 0x0, 
  extra_stdfree = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, 
        __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, members_internal = 0x7f21400ce980, free_pair = {
    hash_next = 0x0, prev = 0x7f2140003b70, next = 0x0, value = 0x7f21400da6c0, key = 0x7f21400ea0f0 "link-count"}, free_pair_in_use = _gf_true}

(gdb) p *xdata->members[0]
$5 = {hash_next = 0x7f2140097640, prev = 0x0, next = 0x7f2140097640, value = 0x7f2140070740, key = 0x7f21400f0400 "glusterfs.content"}
(gdb) p *xdata->members[0]->value
$6 = {is_static = 0 '\000', is_const = 0 '\000', len = 5223, 
  data = 0x7f21400f6aa0 "# /etc/services:\n# $Id: services,v 1.55 2013/04/14 ovasik Exp $\n#\n# Network services, Internet style\n# IANA services version: last updated 2013-04-10\n#\n# Note that it is presently the policy of IANA t"..., refcount = 1, lock = {spinlock = 0, mutex = {__data = {__lock = 0, 
        __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}
(gdb) 


The value of "glusterfs.content" in the xdata returned by this client is correct. However afr chooses a different subvol as the read subvol and returns a different xdata to the higher layer as can be seen by the xdata returned to dht_discover_cbk in f6 (xdata=0x7f21340658c0) versus the xdata this client returned to afr in f8 (xdata=0x7f21400adca0). 

xdata=0x7f21340658c0 does not have the file contents.


(gdb) f 7
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>, frame=0x7f212c0546c0) at afr-common.c:2624
2624		AFR_STACK_UNWIND (lookup, frame, local->op_ret, local->op_errno,

(gdb) p local->replies[0]->xdata
$16 = (dict_t *) 0x7f21400adca0   <---- good one
(gdb) p local->replies[1]->xdata
$17 = (dict_t *) 0x7f21340658c0   <---- "bad" one


AFR returns local->replies[1]->xdata.

(gdb) p *local->replies[1]->xdata
$18 = {is_static = 0 '\000', hash_size = 1, count = 5, refcount = 2, members = 0x7f2134065918, members_list = 0x7f21340387f0, extra_free = 0x0, 
  extra_stdfree = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, 
        __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, members_internal = 0x7f21340387f0, free_pair = {
    hash_next = 0x0, prev = 0x7f213406d630, next = 0x0, value = 0x7f21400f54e0, key = 0x7f213401b230 "link-count"}, free_pair_in_use = _gf_true}

(gdb) p *local->replies[1]->xdata->members[0]
$19 = {hash_next = 0x7f21340398d0, prev = 0x0, next = 0x7f21340398d0, value = 0x7f2140002b50, key = 0x7f213401af10 "glusterfs.content"}
(gdb) p *local->replies[1]->xdata->members[0]->value
$20 = {is_static = 0 '\000', is_const = 0 '\000', len = 0, data = 0x7f21400c35e0 <incomplete sequence \360\255\272>, refcount = 1, lock = {spinlock = 0, 
    mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}

(gdb) p local->replies[1]->xdata->members[0]->value
$21 = (data_t *) 0x7f2140002b50

(gdb) p *local->replies[1]->xdata->members[0]->value
$24 = {is_static = 0 '\000', is_const = 0 '\000', len = 0, data = 0x7f21400c35e0 <incomplete sequence \360\255\272>, refcount = 1, lock = {spinlock = 0, 
    mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}
(gdb) 



qr_content_extract should have got this data ( (data_t *) 0x7f2140002b50) when it called

data = dict_get (xdata, GF_CONTENT_KEY);


However the data it has here is a completely different value - one that does not exist in this xdata.


(gdb) f 2
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
279		memcpy (content, data->data, data->len);
(gdb) p data
$22 = (data_t *) 0x7f2134015300



Why it got this value for data is still not clear. Will continue to look into this.

--- Additional comment from Nithya Balachandran on 2017-08-23 06:22:16 EDT ---

I see a lot of such messages in the brick logs:
rhs-brick1-b2.log:[2017-08-19 12:36:12.886812] W [MSGID: 113096] [posix-handle.c:793:posix_handle_hard] 0-Slave-posix: mismatching ino/dev between file /rhs/brick1/b2/b/thread6/level00/level10/level20/level30/level40/level50/level60/level70/level80/5997ff7f%%3F00F9OEG1 (165686/64774) and handle /rhs/brick1/b2/.glusterfs/cc/de/ccde18b3-e913-40b8-9183-ae39457d2e62 (10535691/64774)


Are these a cause for concern?

--- Additional comment from Amar Tumballi on 2017-08-23 09:18:50 EDT ---

>> I see a lot of such messages in the brick logs:
rhs-brick1-b2.log:[2017-08-19 12:36:12.886812] W [MSGID: 113096] [posix-handle.c:793:posix_handle_hard] 0-Slave-posix: mismatching ino/dev between file /rhs/brick1/b2/b/thread6/level00/level10/level20/level30/level40/level50/level60/level70/level80/5997ff7f%%3F00F9OEG1 (165686/64774) and handle /rhs/brick1/b2/.glusterfs/cc/de/ccde18b3-e913-40b8-9183-ae39457d2e62 (10535691/64774)

Looks to be concerning. Happens when we have separate mount point inside... but how did hardlink succeed?

--- Additional comment from Nithya Balachandran on 2017-08-24 01:34:24 EDT ---

Ravi,

Any idea why AFR returned the xdata with the 0 len glusterfs.content ? In other words, why is the file with no content considered the good copy? Unfortunately the setup is no longer available so we cannot check the files on the bricks.

Regards,
Nithya

--- Additional comment from Nithya Balachandran on 2017-08-24 03:01:13 EDT ---

I think the underlying cause of this is that qr_content_extract uses dict_get() to get the data value. As the data structure has not been ref-ed , it is possible that it was deleted elsewhere causing qr to use a freed structure.

I see afr_lookup_done calls 

dict_del (replies[read_subvol].xdata, GF_CONTENT_KEY); 


I need to check with the AFR team if this can be called in this code path. In the meantime, use dict_get_with_ref in qr_content_extract.

--- Additional comment from Nithya Balachandran on 2017-08-24 03:43:05 EDT ---

It looks like there are some nasty races in the dict code:


static int
dict_get_with_ref (dict_t *this, char *key, data_t **data)
{
...

        hash = SuperFastHash (key, strlen (key));

        LOCK (&this->lock);
        {
                pair = dict_lookup_common (this, key, hash);
        }
        UNLOCK (&this->lock);


<-- at this point, pair might have been deleted as the dictionary is no longer locked. However, the function proceeds to use it anyway and we could end up accessing a different key-value pair if it has been reused in the meantime. 

        if (pair) {
                ret = 0;
                *data = data_ref (pair->value);
        }

err:
        return ret;
}


For functions like 

int
dict_get_ptr (dict_t *this, char *key, void **ptr)
{
        data_t * data = NULL;
        int      ret  = 0;

        if (!this || !key || !ptr) {
                ret = -EINVAL;
                goto err;
        }

        ret = dict_get_with_ref (this, key, &data);
        if (ret != 0) {
                goto err;
        }

        ret = data_to_ptr_common (data, ptr);
        if (ret != 0) {
                goto err;
        }

err:
        if (data)
                data_unref (data);
<-- if this key is deleted here, the data can be unrefed and the pointers being returned are no longer valid.
 
        return ret;
}

--- Additional comment from Worker Ant on 2017-08-24 04:17:09 EDT ---

REVIEW: https://review.gluster.org/18115 (perf/qr: Use a ref-ed data to extract content) posted (#1) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2017-08-28 02:42:31 EDT ---

COMMIT: https://review.gluster.org/18115 committed in master by Raghavendra G (rgowdapp) 
------
commit 414d3e92fc56f08e320a3aa65b6b18e65b384551
Author: N Balachandran <nbalacha>
Date:   Thu Aug 24 13:39:07 2017 +0530

    perf/qr: Use a ref-ed data to extract content
    
    qr_content_extract used dict_get to get the value of
    the GF_CONTENT_KEY key. dict_get does not ref the data
    before returning it so QR could be acting on freed memory
    if another thread deletes the key before then.
    This patch also fixes a race in dict_get_with_ref.
    
    Fix: Use dict_get_with_ref to retrieve the file contents.
    
    Change-Id: Ib1a7a70bb92eed7e70747ec530e0b3edc53127ec
    BUG: 1484709
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/18115
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Amar Tumballi <amarts>
    Tested-by: Raghavendra G <rgowdapp>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

--- Additional comment from Raghavendra G on 2017-08-28 02:45:44 EDT ---

> I think the underlying cause of this is that qr_content_extract uses dict_get() to get the data value. As the data structure has not been ref-ed , it is possible that it was deleted elsewhere causing qr to use a freed structure.

> I see afr_lookup_done calls 

> dict_del (replies[read_subvol].xdata, GF_CONTENT_KEY); 


> I need to check with the AFR team if this can be called in this code path. In the meantime, use dict_get_with_ref in qr_content_extract.

Since the bug describes a distribute-replicate setup, I checked the code of afr_lookup_done. GF_CONTENT_KEY is deleted before afr calls STACK_UNWIND. So, this can't happen in parallel to qr_content_extract.

--- Additional comment from Nithya Balachandran on 2017-08-29 10:48:50 EDT ---

RCA:

(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0, cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117, inode_ret=0x7f212c05bb60, 
    buf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=0x7f212c05bb60, stbuf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete (this=this@entry=0x7f2140016f40, discover_frame=discover_frame@entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>, frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame@entry=0x7f212c0546c0, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode@entry=0x7f212c05bb60, buf=buf@entry=0x7f2147537920, xdata=0x7f21400adca0, postparent=postparent@entry=0x7f2147537990) at afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f2140080400, pollin=pollin@entry=0x7f21400ee830) at rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this@entry=0x7f21400805a0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this@entry=0x7f21400805a0, notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1, data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80, event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6




(gdb) f 6
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
700	                dht_discover_complete (this, frame);
(gdb) l
695	         * which calls STACK_DESTROY (frame). In the case of "attempt_unwind",
696	         * this makes sure that the thread don't call dht_frame_return, till
697	         * call to dht_discover_complete is done.
698	         */
699	        if (attempt_unwind) {
700	                dht_discover_complete (this, frame);  <--- Here
701	        }
702	
703	        this_call_cnt = dht_frame_return (frame);
704


dht_discover_cbk has found a file so it tries to unwind before all subvols have responded. So it is possible that other subvols have not yet responded.




Checking the frame for dht_discover_cbk:

gdb) f 6
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
700	                dht_discover_complete (this, frame);
(gdb) p local->file_count
$6 = 3

Strangely a file with this gfid was found on _all_ 3 children. As local->file_count doe not distinguish between data files or a linkto files,it seems like there was a stale (?) linkto file on at least one subvol.

So at this point all 3 subvols have at least unwound to dht_discover_cbk. 

(gdb) p local->dir_count
$7 = 0

(gdb) p is_linkfile
$8 = <optimized out>
(gdb) p local->cached_subvol
$9 = (xlator_t *) 0x7f2140014d20
(gdb) p local->hashed_subvol
$10 = (xlator_t *) 0x0
(gdb) p *local->cached_subvol
$11 = {name = 0x7f21400158c0 "Slave-replicate-1", type = 0x7f2140015a40 "cluster/replicate", instance_name = 0x0, next = 0x7f21400135b0, prev = 0x7f2140015e30, 
  parents = 0x7f2140018630, children = 0x7f2140015d10, options = 0x7f2140015910, dlhandle = 0x7f2140014320, fops = 0x7f2146ad91a0 <fops>, 
  cbks = 0x7f2146ad90e0 <cbks>, dumpops = 0x7f2146ad9140 <dumpops>, volume_options = {next = 0x7f2140015b40, prev = 0x7f2140015b40}, 
  fini = 0x7f21468c6ac0 <fini>, init = 0x7f21468c5e10 <init>, reconfigure = 0x7f21468c5740 <reconfigure>, mem_acct_init = 0x7f21468c5670 <mem_acct_init>, 
  notify = 0x7f21468c5610 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, 
  history = 0x0, ctx = 0x56504ff06010, graph = 0x7f2140003f30, itable = 0x7f214005e0a0, init_succeeded = 1 '\001', private = 0x7f214005db00, 
  mem_acct = 0x7f214005a0b0, winds = 0, switched = 0 '\000', local_pool = 0x7f21548f2ab8 <pools+280>, is_autoloaded = _gf_false, volfile_id = 0x0}
(gdb) p *prev
$12 = {name = 0x7f2140012740 "Slave-replicate-0", type = 0x7f2140014260 "cluster/replicate", instance_name = 0x0, next = 0x7f21400118a0, prev = 0x7f2140014d20, 
  parents = 0x7f2140018480, children = 0x7f2140014c00, options = 0x7f2140014130, dlhandle = 0x7f2140014320, fops = 0x7f2146ad91a0 <fops>, 
  cbks = 0x7f2146ad90e0 <cbks>, dumpops = 0x7f2146ad9140 <dumpops>, volume_options = {next = 0x7f2140014a30, prev = 0x7f2140014a30}, 
  fini = 0x7f21468c6ac0 <fini>, init = 0x7f21468c5e10 <init>, reconfigure = 0x7f21468c5740 <reconfigure>, mem_acct_init = 0x7f21468c5670 <mem_acct_init>, 
  notify = 0x7f21468c5610 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, 
  history = 0x0, ctx = 0x56504ff06010, graph = 0x7f2140003f30, itable = 0x7f2140065080, init_succeeded = 1 '\001', private = 0x7f2140064ae0, 
  mem_acct = 0x7f2140061090, winds = 0, switched = 0 '\000', local_pool = 0x7f21548f2ab8 <pools+280>, is_autoloaded = _gf_false, volfile_id = 0x0}


As the local->cached subvol does not match prev, it is clear that local->cached_subvol was updated after this thread proceeded to unwind to the higher xlators. This means a data file was found on at least 2 subvols - this could be because a linkto file was in the process of being created when the lookup hit the brick.




In dht_discover_complete () :

        local = discover_frame->local;   
        layout = local->layout;
        conf = this->private;

        LOCK(&discover_frame->lock);
        {
                main_frame = local->main_frame;
                local->main_frame = NULL;
        }
        UNLOCK(&discover_frame->lock);

        if (!main_frame)
                return 0;    <--- So far so good - only one thread will unwind here.

However, dicover_frame->local is valid and _still_ accessible to other threads unwinding to dht_discover_cbk.




In dht_discover_cbk:
...

        local = frame->local;

...
        LOCK (&frame->lock);
        {

...
                if (is_dir) {
                        local->dir_count ++;
                } else {
                        local->file_count ++;

                        if (!is_linkfile) {
                                /* real file */
                                local->cached_subvol = prev;
                                attempt_unwind = 1;
                        } else {
                                goto unlock;
                        }
                }


...

                if (local->xattr == NULL) {
                        local->xattr = dict_ref (xattr); <-- say thread 1 (the one that crashed )did this
                } else {
                        dht_aggregate_xattr (local->xattr, xattr);   <---  and thread 2 called this (this should _only_ be done for directories.)
                }


In dht_aggregate_xattr () -> dht_aggregate ():

        ret = dict_set (dst, key, value);
        if (ret) {
                gf_msg ("dht", GF_LOG_WARNING, 0,
                        DHT_MSG_DICT_SET_FAILED,
                        "Failed to set dictionary value: key = %s",
                        key);
        }



dict_set _replaces_ the key which unrefs the old data, causing its ref-count to become 0 and hence destroying it. 

So here is probably what happened:

Time:    subvol-0                                   subvol-1

t0      dht_discover_cbk
        local->xattr = dict_ref (dict0);
        dict_set (dict0, GF_CONTENT_KEY, data0)
	attempt_unwind = 1


t1     dht_discover_complete




t2     qr_content_extract gets data0



t3                                                 dht_discover_cbk
                                                   dht_aggregate_xattr (local->xattr, xattr)
						     |->dict_set (dict0, GF_CONTENT_KEY, data1)  
							 |-><-- data_unref (data0)  <-- As ref-count = 0, data0 is destroyed
	                                           attempt_unwind = 1



t3    qr_content_extract accesses data0 and crashes

--- Additional comment from Worker Ant on 2017-08-29 11:51:49 EDT ---

REVIEW: https://review.gluster.org/18137 (cluster/dht: Aggregate xattrs only for dirs in dht_discover_cbk) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 1 Worker Ant 2017-08-30 04:38:19 UTC

REVIEW: https://review.gluster.org/18145 (perf/qr: Use a ref-ed data to extract content) posted (#1) for review on release-3.12 by N Balachandran (nbalacha)

Comment 2 Worker Ant 2017-08-30 05:16:16 UTC

REVIEW: https://review.gluster.org/18146 (cluster/dht: Aggregate xattrs only for dirs in dht_discover_cbk) posted (#1) for review on release-3.12 by N Balachandran (nbalacha)

Comment 3 Worker Ant 2017-09-07 06:57:46 UTC

COMMIT: https://review.gluster.org/18145 committed in release-3.12 by jiffin tony Thottan (jthottan) 
------
commit a862ca969cfd4dc9309e8123e6e90c4e47f9b89b
Author: N Balachandran <nbalacha>
Date:   Wed Aug 30 10:01:54 2017 +0530

    perf/qr: Use a ref-ed data to extract content
    
    qr_content_extract used dict_get to get the value of
    the GF_CONTENT_KEY key. dict_get does not ref the data
    before returning it so QR could be acting on freed memory
    if another thread deletes the key before then.
    This patch also fixes a race in dict_get_with_ref.
    
    Fix: Use dict_get_with_ref to retrieve the file contents.
    
    > BUG: 1484709
    > Signed-off-by: N Balachandran <nbalacha>
    > Reviewed-on: https://review.gluster.org/18115
    > Smoke: Gluster Build System <jenkins.org>
    > Reviewed-by: Amar Tumballi <amarts>
    > Tested-by: Raghavendra G <rgowdapp>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Raghavendra G <rgowdapp>
    
    Change-Id: Ib1a7a70bb92eed7e70747ec530e0b3edc53127ec
    BUG: 1486538
    (cherry picked from commit 414d3e92fc56f08e320a3aa65b6b18e65b384551)
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/18145
    Reviewed-by: Raghavendra G <rgowdapp>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: MOHIT AGRAWAL <moagrawa>

Comment 4 Nithya Balachandran 2017-09-07 07:02:42 UTC

Moving this back to Post as https://review.gluster.org/#/c/18146/ has not yet been merged.

Comment 5 Worker Ant 2017-09-07 07:07:59 UTC

COMMIT: https://review.gluster.org/18146 committed in release-3.12 by jiffin tony Thottan (jthottan) 
------
commit 268bb635aca729a24242412c6f9dd9d5ba2a2654
Author: N Balachandran <nbalacha>
Date:   Wed Aug 30 10:36:45 2017 +0530

    cluster/dht: Aggregate xattrs only for dirs in dht_discover_cbk
    
    If dht_discover finds data files on more than one subvol,
    racing calls to dht_discover_cbk could end up calling
    dht_aggregate_xattr which could delete dictionary data
    that is being accessed by higher layer translators.
    Fixed to call dht_aggregate_xattr only for directories and
    consider only the first file to be found.
    
    > BUG: 1484709
    > Signed-off-by: N Balachandran <nbalacha>
    > Reviewed-on: https://review.gluster.org/18137
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Raghavendra G <rgowdapp>
    
    Change-Id: I4f3d2a405ec735d4f1bb33a04b7255eb2d179f8a
    BUG: 1486538
    (cherry picked from commit 9420022df0962b1fa4f3ea3774249be81bc945cc)
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/18146
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>
    Reviewed-by: Amar Tumballi <amarts>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 6 Jiffin 2017-09-14 07:42:56 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-glusterfs-3.12.1, please open a new bug report.

glusterfs-glusterfs-3.12.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-September/032441.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.