Bug 1263532 - Tier: volume FUSE client crashes when running a find during attach tier
Tier: volume FUSE client crashes when running a find during attach tier
Product: GlusterFS
Classification: Community
Component: tiering (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Nithya Balachandran
Depends On:
Blocks: 1278355
  Show dependency treegraph
Reported: 2015-09-16 02:02 EDT by Nithya Balachandran
Modified: 2016-02-17 19:07 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1278355 (view as bug list)
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Nithya Balachandran 2015-09-16 02:02:15 EDT
Description of problem:

The FUSE client process for a tiered volume crashes on concurrent attach-tier and find operations

Version-Release number of selected component (if applicable):

How reproducible:
Random but pretty common

Steps to Reproduce:
1. Create a distribute-replicate volume
2. Fuse mount the volume and untar a large file (I used a tar of the glusterfs source code)
3. While the untar is in progress, attach a dist-rep hot tier or 4 bricks
4. Run find <mnt-path>

Actual results:
The FUSE client process will crash

Expected results:
The client process should not crash

Additional info:
Comment 1 Nithya Balachandran 2015-09-16 02:32:30 EDT
Back trace:

#0  0x00000033aa609420 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f0797aa6472 in gf_log_set_log_buf_size (buf_size=0) at logging.c:254
#2  0x00007f0797aa6989 in gf_log_disable_suppression_before_exit (ctx=0x21d4010) at logging.c:426
#3  0x00007f0797ac725f in gf_print_trace (signum=11, ctx=0x21d4010) at common-utils.c:579
#4  0x000000000040976e in glusterfsd_print_trace (signum=11) at glusterfsd.c:2021
#5  <signal handler called>
#6  0x0000000000000000 in ?? ()
#7  0x00007f078c32579c in dht_selfheal_dir_setattr (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, stbuf=0x7f07803e204c, valid=-1, 
    layout=0x7f07883a3810) at dht-selfheal.c:1078
#8  0x00007f078c325e73 in dht_selfheal_dir_mkdir (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, layout=0x7f07883a3810, force=0)
    at dht-selfheal.c:1209
#9  0x00007f078c327a1c in dht_selfheal_directory (frame=0x7f0796921ca4, dir_cbk=0x7f078c334b8c <dht_lookup_selfheal_cbk>, 
    loc=0x7f07803e1fbc, layout=0x7f07883a3810) at dht-selfheal.c:1823
#10 0x00007f078c337044 in dht_lookup_dir_cbk (frame=0x7f0796921ca4, cookie=0x7f0796923ee8, this=0x7f07803c5750, op_ret=-1, 
    op_errno=116, inode=0x0, stbuf=0x7f0780571564, xattr=0x0, postparent=0x7f0780571794) at dht-common.c:665
#11 0x00007f078c337310 in dht_lookup_dir_cbk (frame=0x7f0796923ee8, cookie=0x7f079692087c, this=0x7f07803c49b0, op_ret=-1, 
    op_errno=116, inode=0x0, stbuf=0x0, xattr=0x0, postparent=0x0) at dht-common.c:655
#12 0x00007f078c5ea4ce in afr_lookup_do (frame=0x7f079692087c, this=0x7f07803c3c20, err=-116) at afr-common.c:2315
#13 0x00007f078c5e538c in afr_inode_refresh_done (frame=0x7f079692087c, this=0x7f07803c3c20) at afr-common.c:839
#14 0x00007f078c5e55e5 in afr_inode_refresh_subvol_cbk (frame=0x7f079692087c, cookie=0x1, this=0x7f07803c3c20, op_ret=-1, 
    op_errno=116, inode=0x7f076abc7628, buf=0x7f0785e6c820, xdata=0x0, par=0x7f0785e6c7b0) at afr-common.c:869
#15 0x00007f078c8358d3 in client3_3_lookup_cbk (req=0x7f07814570ac, iov=0x7f07814570ec, count=1, myframe=0x7f0796920d30)
    at client-rpc-fops.c:2978
#16 0x00007f079786f5e3 in rpc_clnt_handle_reply (clnt=0x7f0781456e20, pollin=0x7f078002aab0) at rpc-clnt.c:766
#17 0x00007f079786fa81 in rpc_clnt_notify (trans=0x7f07814b0200, mydata=0x7f0781456e50, event=RPC_TRANSPORT_MSG_RECEIVED, 
    data=0x7f078002aab0) at rpc-clnt.c:907
#18 0x00007f079786bbaf in rpc_transport_notify (this=0x7f07814b0200, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f078002aab0)
    at rpc-transport.c:544
#19 0x00007f078d67edb5 in socket_event_poll_in (this=0x7f07814b0200) at socket.c:2236
#20 0x00007f078d67f30b in socket_event_handler (fd=16, idx=6, data=0x7f07814b0200, poll_in=1, poll_out=0, poll_err=0)

(gdb) f 7
#7  0x00007f078c32579c in dht_selfheal_dir_setattr (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, stbuf=0x7f07803e204c, valid=-1, 
    layout=0x7f07883a3810) at dht-selfheal.c:1078
1078	                        STACK_WIND (frame, dht_selfheal_dir_setattr_cbk,
(gdb) p layout
$10 = (dht_layout_t *) 0x7f07883a3810
(gdb) p *layout
$11 = {spread_cnt = 1937076852, cnt = 778331508, preset = 1937075303, commit_hash = 1718773108, gen = 1751395955, 
  type = -267583372, ref = 47789, search_unhashed = _gf_false, list = 0x7f07883a3810}

The layout structure is corrupt. Why it is corrupt needs further analysis.
Comment 2 Nithya Balachandran 2015-10-07 07:33:41 EDT
The issue is consistently reproducible and appears to be an issue with the frames. Several cores have shown an incorrect frame for the tier-dht layer (where frame->this != tier-dht-xlator) as well as local != frame->local . This requires further analysis to figure out why.
Comment 3 Nithya Balachandran 2015-10-19 06:06:45 EDT
The issue is not reproducible with http://review.gluster.org/#/c/12184/. However, there is an issue which was exposed with the earlier codepath and I am keeping this BZ open to track it.

This should not hold up tier QE.

Note You need to log in before you can comment on or make changes to this bug.