1484113 – [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk during rename/hardlink/rebalance cases

Bug 1484113 - [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk during rename/hardlink/rebalance cases

Summary: [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk during rename/hard...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Nithya Balachandran
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:	rebase
Depends On:	1484709 1486538 1569934
Blocks:	1503134
TreeView+	depends on / blocked

Reported:	2017-08-22 17:46 UTC by Rahul Hinduja
Modified:	2018-09-14 04:45 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.12.2-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1484709 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:35:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:37:32 UTC

Description Rahul Hinduja 2017-08-22 17:46:10 UTC

Description of problem:
=======================

While trying rebalance as part of remove brick containing hardlinks/rename io pattern resulted in multiple crashes at slave with the following bt: 

(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0, cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117, inode_ret=0x7f212c05bb60, 
    buf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=0x7f212c05bb60, stbuf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete (this=this@entry=0x7f2140016f40, discover_frame=discover_frame@entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>, frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame@entry=0x7f212c0546c0, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode@entry=0x7f212c05bb60, buf=buf@entry=0x7f2147537920, xdata=0x7f21400adca0, postparent=postparent@entry=0x7f2147537990) at afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380) at client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f2140080400, pollin=pollin@entry=0x7f21400ee830) at rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this@entry=0x7f21400805a0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this@entry=0x7f21400805a0, notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1, data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80, event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6
(gdb) 

Similar bug was seen when Master volume was tiered and slave was DR as mentioned in BZ 1291617 . However this bug was verified after retrial of 3 times on the same setup. 

Raising this as new bug without tier volume and with heavy IO pattern. 


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-41.el7rhgs.x86_64


How reproducible:
=================

2/2


Steps Carried:
==============
1. Create Master volume (5x2) and Slave volume (4x2)
2. Setup geo-rep session between master and slave
3. Mount the Master volume and start the following IO:

for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,rename,chown,rename,create,hardlink,hardlink,symlink,rename}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i /mnt/master_nfs/b ; sleep 10 ; done

4. Wait for couple of mins for geo-rep to start syncing these files to slave. 
5. While syncing is in progress, remove one subvolume from slave volume (3x2) via rebalance start 
6. After about 10 mins, remove one subvolume from Master volume too (4x2)
7. Keep checking rebalance status, once it is completed at slave, commit the remove brick
8. Keep checking rebalance status, once it is completed at Master, stop the geo-replication, commit the remove brick and start the geo-replication. 


Actual results:
===============

Crashes observed at slave


Expected results:
=================

No crashes should be seen

Comment 4 Nithya Balachandran 2017-08-23 10:09:56 UTC

Here is what I have so far:

gdb -c /core.4537 /usr/sbin/glusterfs

Core was generated by `/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.15.2-50.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0, cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117, 
    inode_ret=0x7f212c05bb60, buf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, 
    op_errno=<optimized out>, inode=0x7f212c05bb60, stbuf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete (this=this@entry=0x7f2140016f40, discover_frame=discover_frame@entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>, frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame@entry=0x7f212c0546c0, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode@entry=0x7f212c05bb60, buf=buf@entry=0x7f2147537920, xdata=0x7f21400adca0, postparent=postparent@entry=0x7f2147537990) at afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f2140080400, pollin=pollin@entry=0x7f21400ee830) at rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this@entry=0x7f21400805a0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this@entry=0x7f21400805a0, notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1, data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80, event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6
(gdb) 

The process has crashed in __memcpy_sse2 which was called in qr_content_extract.

(gdb) f 2
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
279		memcpy (content, data->data, data->len);
(gdb) p *data
$1 = {is_static = 0 '\000', is_const = 0 '\000', len = -1162167622, data = 0x7f2134000cf0 "0", refcount = 0, lock = {spinlock = 0, mutex = {__data = {
        __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}}
(gdb) p/x *data
$2 = {is_static = 0x0, is_const = 0x0, len = 0xbabababa, data = 0x7f2134000cf0, refcount = 0x0, lock = {spinlock = 0x0, mutex = {__data = {__lock = 0x0, 
        __count = 0x0, __owner = 0x0, __nusers = 0x0, __kind = 0xffffffff, __spins = 0x0, __elision = 0x0, __list = {__prev = 0x0, __next = 0x0}}, __size = {
        0x0 <repeats 16 times>, 0xff, 0xff, 0xff, 0xff, 0x0 <repeats 20 times>}, __align = 0x0}}}
(gdb)

This data struct looks like it has already been freed. It looks like the memcpy has the wrong len (data->len = 0xbabababa ) which causes the crash.

Analysing the core
-------------------

First, check if the client returns the correct xdata:

(gdb) f 9
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
2947	        CLIENT_STACK_UNWIND (lookup, frame, rsp.op_ret, rsp.op_errno, inode,

(gdb) p rsp.op_ret
$3 = 0
(gdb) p *xdata
$4 = {is_static = 0 '\000', hash_size = 1, count = 5, refcount = 2, members = 0x7f21400adcf8, members_list = 0x7f21400ce980, extra_free = 0x0, 
  extra_stdfree = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, 
        __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, members_internal = 0x7f21400ce980, free_pair = {
    hash_next = 0x0, prev = 0x7f2140003b70, next = 0x0, value = 0x7f21400da6c0, key = 0x7f21400ea0f0 "link-count"}, free_pair_in_use = _gf_true}

(gdb) p *xdata->members[0]
$5 = {hash_next = 0x7f2140097640, prev = 0x0, next = 0x7f2140097640, value = 0x7f2140070740, key = 0x7f21400f0400 "glusterfs.content"}
(gdb) p *xdata->members[0]->value
$6 = {is_static = 0 '\000', is_const = 0 '\000', len = 5223, 
  data = 0x7f21400f6aa0 "# /etc/services:\n# $Id: services,v 1.55 2013/04/14 ovasik Exp $\n#\n# Network services, Internet style\n# IANA services version: last updated 2013-04-10\n#\n# Note that it is presently the policy of IANA t"..., refcount = 1, lock = {spinlock = 0, mutex = {__data = {__lock = 0, 
        __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}
(gdb) 


The value of "glusterfs.content" in the xdata returned by this client is correct. However afr chooses a different subvol as the read subvol and returns a different xdata to the higher layer as can be seen by the xdata returned to dht_discover_cbk in f6 (xdata=0x7f21340658c0) versus the xdata this client returned to afr in f8 (xdata=0x7f21400adca0). 

xdata=0x7f21340658c0 does not have the file contents.


(gdb) f 7
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>, frame=0x7f212c0546c0) at afr-common.c:2624
2624		AFR_STACK_UNWIND (lookup, frame, local->op_ret, local->op_errno,

(gdb) p local->replies[0]->xdata
$16 = (dict_t *) 0x7f21400adca0   <---- good one
(gdb) p local->replies[1]->xdata
$17 = (dict_t *) 0x7f21340658c0   <---- "bad" one


AFR returns local->replies[1]->xdata.

(gdb) p *local->replies[1]->xdata
$18 = {is_static = 0 '\000', hash_size = 1, count = 5, refcount = 2, members = 0x7f2134065918, members_list = 0x7f21340387f0, extra_free = 0x0, 
  extra_stdfree = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, 
        __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, members_internal = 0x7f21340387f0, free_pair = {
    hash_next = 0x0, prev = 0x7f213406d630, next = 0x0, value = 0x7f21400f54e0, key = 0x7f213401b230 "link-count"}, free_pair_in_use = _gf_true}

(gdb) p *local->replies[1]->xdata->members[0]
$19 = {hash_next = 0x7f21340398d0, prev = 0x0, next = 0x7f21340398d0, value = 0x7f2140002b50, key = 0x7f213401af10 "glusterfs.content"}
(gdb) p *local->replies[1]->xdata->members[0]->value
$20 = {is_static = 0 '\000', is_const = 0 '\000', len = 0, data = 0x7f21400c35e0 <incomplete sequence \360\255\272>, refcount = 1, lock = {spinlock = 0, 
    mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}

(gdb) p local->replies[1]->xdata->members[0]->value
$21 = (data_t *) 0x7f2140002b50

(gdb) p *local->replies[1]->xdata->members[0]->value
$24 = {is_static = 0 '\000', is_const = 0 '\000', len = 0, data = 0x7f21400c35e0 <incomplete sequence \360\255\272>, refcount = 1, lock = {spinlock = 0, 
    mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}
(gdb) 



qr_content_extract should have got this data ( (data_t *) 0x7f2140002b50) when it called

data = dict_get (xdata, GF_CONTENT_KEY);


However the data it has here is a completely different value - one that does not exist in this xdata.


(gdb) f 2
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
279		memcpy (content, data->data, data->len);
(gdb) p data
$22 = (data_t *) 0x7f2134015300



Why it got this value for data is still not clear. Will continue to look into this.

Comment 5 Nithya Balachandran 2017-08-23 10:22:16 UTC

I see a lot of such messages in the brick logs:
rhs-brick1-b2.log:[2017-08-19 12:36:12.886812] W [MSGID: 113096] [posix-handle.c:793:posix_handle_hard] 0-Slave-posix: mismatching ino/dev between file /rhs/brick1/b2/b/thread6/level00/level10/level20/level30/level40/level50/level60/level70/level80/5997ff7f%%3F00F9OEG1 (165686/64774) and handle /rhs/brick1/b2/.glusterfs/cc/de/ccde18b3-e913-40b8-9183-ae39457d2e62 (10535691/64774)


Are these a cause for concern?

Comment 6 Amar Tumballi 2017-08-23 13:18:50 UTC

>> I see a lot of such messages in the brick logs:
rhs-brick1-b2.log:[2017-08-19 12:36:12.886812] W [MSGID: 113096] [posix-handle.c:793:posix_handle_hard] 0-Slave-posix: mismatching ino/dev between file /rhs/brick1/b2/b/thread6/level00/level10/level20/level30/level40/level50/level60/level70/level80/5997ff7f%%3F00F9OEG1 (165686/64774) and handle /rhs/brick1/b2/.glusterfs/cc/de/ccde18b3-e913-40b8-9183-ae39457d2e62 (10535691/64774)

Looks to be concerning. Happens when we have separate mount point inside... but how did hardlink succeed?

Comment 7 Nithya Balachandran 2017-08-24 05:34:24 UTC

Ravi,

Any idea why AFR returned the xdata with the 0 len glusterfs.content ? In other words, why is the file with no content considered the good copy? Unfortunately the setup is no longer available so we cannot check the files on the bricks.

Regards,
Nithya

Comment 8 Nithya Balachandran 2017-08-24 07:01:13 UTC

I think the underlying cause of this is that qr_content_extract uses dict_get() to get the data value. As the data structure has not been ref-ed , it is possible that it was deleted elsewhere causing qr to use a freed structure.

I see afr_lookup_done calls 

dict_del (replies[read_subvol].xdata, GF_CONTENT_KEY); 


I need to check with the AFR team if this can be called in this code path. In the meantime, use dict_get_with_ref in qr_content_extract.

Comment 9 Nithya Balachandran 2017-08-24 07:43:05 UTC

It looks like there are some nasty races in the dict code:


static int
dict_get_with_ref (dict_t *this, char *key, data_t **data)
{
...

        hash = SuperFastHash (key, strlen (key));

        LOCK (&this->lock);
        {
                pair = dict_lookup_common (this, key, hash);
        }
        UNLOCK (&this->lock);


<-- at this point, pair might have been deleted as the dictionary is no longer locked. However, the function proceeds to use it anyway and we could end up accessing a different key-value pair if it has been reused in the meantime. 

        if (pair) {
                ret = 0;
                *data = data_ref (pair->value);
        }

err:
        return ret;
}


For functions like 

int
dict_get_ptr (dict_t *this, char *key, void **ptr)
{
        data_t * data = NULL;
        int      ret  = 0;

        if (!this || !key || !ptr) {
                ret = -EINVAL;
                goto err;
        }

        ret = dict_get_with_ref (this, key, &data);
        if (ret != 0) {
                goto err;
        }

        ret = data_to_ptr_common (data, ptr);
        if (ret != 0) {
                goto err;
        }

err:
        if (data)
                data_unref (data);
<-- if this key is deleted here, the data can be unrefed and the pointers being returned are no longer valid.
 
        return ret;
}

Comment 10 Nithya Balachandran 2017-08-25 12:59:02 UTC

A possible fix is posted at: https://review.gluster.org/18115

I still don't know who deleted the key.

Comment 11 Ravishankar N 2017-08-27 12:55:20 UTC

(In reply to Nithya Balachandran from comment #7)
> Ravi,
> 
> Any idea why AFR returned the xdata with the 0 len glusterfs.content ? In
> other words, why is the file with no content considered the good copy?
> Unfortunately the setup is no longer available so we cannot check the files
> on the bricks.
> 
> Regards,
> Nithya

Sorry for the delay. I checked the core, the dict does not contain any trusted.afr.client-x xattrs indicating pending heals. It does contain trusted.afr.dirty but they are zeroes:

(gdb) p local->replies[0].xdata->members->next->key
$9 = 0x7f214008f800 "trusted.afr.dirty"
(gdb) p /x *local->replies[0].xdata->members->next->value->data@12
$10 = {0x0 <repeats 12 times>}

(gdb)  p local->replies[1].xdata->members->key
$11 = 0x7f213401af10 "glusterfs.content"
(gdb) p /x *local->replies[1].xdata->members->next->value->data@12
$12 = {0x0 <repeats 12 times>}

So as far as AFR is concerned, both copies were good (op_ret was also zero from both its children in discover_cbk).

Reg. your query in comment #8, AFR deletes the GF_CONTENT_KEY value in the lookup code path (which is name based) if there are pending heals. The discover code path (which is gfid based) did not do it. I have now done it as a part of https://review.gluster.org/#/c/17673 which fixes some other bugs in lookup code path.

Comment 15 Nithya Balachandran 2017-08-29 14:47:25 UTC

RCA:

(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata@entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0, cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117, inode_ret=0x7f212c05bb60, 
    buf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=0x7f212c05bb60, stbuf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete (this=this@entry=0x7f2140016f40, discover_frame=discover_frame@entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>, frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame@entry=0x7f212c0546c0, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode@entry=0x7f212c05bb60, buf=buf@entry=0x7f2147537920, xdata=0x7f21400adca0, postparent=postparent@entry=0x7f2147537990) at afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f2140080400, pollin=pollin@entry=0x7f21400ee830) at rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this@entry=0x7f21400805a0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this@entry=0x7f21400805a0, notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1, data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80, event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6




(gdb) f 6
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
700	                dht_discover_complete (this, frame);
(gdb) l
695	         * which calls STACK_DESTROY (frame). In the case of "attempt_unwind",
696	         * this makes sure that the thread don't call dht_frame_return, till
697	         * call to dht_discover_complete is done.
698	         */
699	        if (attempt_unwind) {
700	                dht_discover_complete (this, frame);  <--- Here
701	        }
702	
703	        this_call_cnt = dht_frame_return (frame);
704


dht_discover_cbk has found a file so it tries to unwind before all subvols have responded. So it is possible that other subvols have not yet responded.




Checking the frame for dht_discover_cbk:

gdb) f 6
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500, cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>, op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0, postparent=0x7f212c05e6b8) at dht-common.c:700
700	                dht_discover_complete (this, frame);
(gdb) p local->file_count
$6 = 3

Strangely a file with this gfid was found on _all_ 3 children. As local->file_count doe not distinguish between data files or a linkto files,it seems like there was a stale (?) linkto file on at least one subvol.

So at this point all 3 subvols have at least unwound to dht_discover_cbk. 

(gdb) p local->dir_count
$7 = 0

(gdb) p is_linkfile
$8 = <optimized out>
(gdb) p local->cached_subvol
$9 = (xlator_t *) 0x7f2140014d20
(gdb) p local->hashed_subvol
$10 = (xlator_t *) 0x0
(gdb) p *local->cached_subvol
$11 = {name = 0x7f21400158c0 "Slave-replicate-1", type = 0x7f2140015a40 "cluster/replicate", instance_name = 0x0, next = 0x7f21400135b0, prev = 0x7f2140015e30, 
  parents = 0x7f2140018630, children = 0x7f2140015d10, options = 0x7f2140015910, dlhandle = 0x7f2140014320, fops = 0x7f2146ad91a0 <fops>, 
  cbks = 0x7f2146ad90e0 <cbks>, dumpops = 0x7f2146ad9140 <dumpops>, volume_options = {next = 0x7f2140015b40, prev = 0x7f2140015b40}, 
  fini = 0x7f21468c6ac0 <fini>, init = 0x7f21468c5e10 <init>, reconfigure = 0x7f21468c5740 <reconfigure>, mem_acct_init = 0x7f21468c5670 <mem_acct_init>, 
  notify = 0x7f21468c5610 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, 
  history = 0x0, ctx = 0x56504ff06010, graph = 0x7f2140003f30, itable = 0x7f214005e0a0, init_succeeded = 1 '\001', private = 0x7f214005db00, 
  mem_acct = 0x7f214005a0b0, winds = 0, switched = 0 '\000', local_pool = 0x7f21548f2ab8 <pools+280>, is_autoloaded = _gf_false, volfile_id = 0x0}
(gdb) p *prev
$12 = {name = 0x7f2140012740 "Slave-replicate-0", type = 0x7f2140014260 "cluster/replicate", instance_name = 0x0, next = 0x7f21400118a0, prev = 0x7f2140014d20, 
  parents = 0x7f2140018480, children = 0x7f2140014c00, options = 0x7f2140014130, dlhandle = 0x7f2140014320, fops = 0x7f2146ad91a0 <fops>, 
  cbks = 0x7f2146ad90e0 <cbks>, dumpops = 0x7f2146ad9140 <dumpops>, volume_options = {next = 0x7f2140014a30, prev = 0x7f2140014a30}, 
  fini = 0x7f21468c6ac0 <fini>, init = 0x7f21468c5e10 <init>, reconfigure = 0x7f21468c5740 <reconfigure>, mem_acct_init = 0x7f21468c5670 <mem_acct_init>, 
  notify = 0x7f21468c5610 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, 
  history = 0x0, ctx = 0x56504ff06010, graph = 0x7f2140003f30, itable = 0x7f2140065080, init_succeeded = 1 '\001', private = 0x7f2140064ae0, 
  mem_acct = 0x7f2140061090, winds = 0, switched = 0 '\000', local_pool = 0x7f21548f2ab8 <pools+280>, is_autoloaded = _gf_false, volfile_id = 0x0}


As the local->cached subvol does not match prev, it is clear that local->cached_subvol was updated after this thread proceeded to unwind to the higher xlators. This means a data file was found on at least 2 subvols - this could be because a linkto file was in the process of being created when the lookup hit the brick.




In dht_discover_complete () :

        local = discover_frame->local;   
        layout = local->layout;
        conf = this->private;

        LOCK(&discover_frame->lock);
        {
                main_frame = local->main_frame;
                local->main_frame = NULL;
        }
        UNLOCK(&discover_frame->lock);

        if (!main_frame)
                return 0;    <--- So far so good - only one thread will unwind here.

However, dicover_frame->local is valid and _still_ accessible to other threads unwinding to dht_discover_cbk.




In dht_discover_cbk:
...

        local = frame->local;

...
        LOCK (&frame->lock);
        {

...
                if (is_dir) {
                        local->dir_count ++;
                } else {
                        local->file_count ++;

                        if (!is_linkfile) {
                                /* real file */
                                local->cached_subvol = prev;
                                attempt_unwind = 1;
                        } else {
                                goto unlock;
                        }
                }


...

                if (local->xattr == NULL) {
                        local->xattr = dict_ref (xattr); <-- say thread 1 (the one that crashed )did this
                } else {
                        dht_aggregate_xattr (local->xattr, xattr);   <---  and thread 2 called this (this should _only_ be done for directories.)
                }


In dht_aggregate_xattr () -> dht_aggregate ():

        ret = dict_set (dst, key, value);
        if (ret) {
                gf_msg ("dht", GF_LOG_WARNING, 0,
                        DHT_MSG_DICT_SET_FAILED,
                        "Failed to set dictionary value: key = %s",
                        key);
        }



dict_set _replaces_ the key which unrefs the old data, causing its ref-count to become 0 and hence destroying it. If qr had got the old data before it was deleted, it will crash when it tries to access it now. 

So here is probably what happened:

Time:    subvol-0                                         subvol-1

t0      dht_discover_cbk
        local->xattr = dict_ref (dict0);
        dict_set (dict0, GF_CONTENT_KEY, data0)
	attempt_unwind = 1


t1     dht_discover_complete




t2     qr_content_extract gets data0



t3                                                 dht_discover_cbk
                                                   dht_aggregate_xattr (local-                      >xattr, xattr)
						     |->dict_set (dict0, GF_CONTENT_KEY, data1)  
							 |-><-- data_unref (data0)  <-- As ref-count = 0, data0 is destroyed
	                                           attempt_unwind = 1



t3    qr_content_extract accesses data0 and crashes

Comment 20 errata-xmlrpc 2018-09-04 06:35:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.