Bug 1278355 - Tier: volume FUSE client crashes when running a find during attach tier
Tier: volume FUSE client crashes when running a find during attach tier
Status: ASSIGNED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: tier (Show other bugs)
3.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Nithya Balachandran
nchilaka
tier-attach-detach
: ZStream
Depends On: 1263532
Blocks: 1260923
  Show dependency treegraph
 
Reported: 2015-11-05 05:27 EST by nchilaka
Modified: 2017-03-25 12:25 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1263532
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description nchilaka 2015-11-05 05:27:47 EST
+++ This bug was initially created as a clone of Bug #1263532 +++

Description of problem:

The FUSE client process for a tiered volume crashes on concurrent attach-tier and find operations

Version-Release number of selected component (if applicable):


How reproducible:
Random but pretty common

Steps to Reproduce:
1. Create a distribute-replicate volume
2. Fuse mount the volume and untar a large file (I used a tar of the glusterfs source code)
3. While the untar is in progress, attach a dist-rep hot tier or 4 bricks
4. Run find <mnt-path>


Actual results:
The FUSE client process will crash

Expected results:
The client process should not crash

Additional info:

--- Additional comment from Nithya Balachandran on 2015-09-16 02:32:30 EDT ---

Back trace:

#0  0x00000033aa609420 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f0797aa6472 in gf_log_set_log_buf_size (buf_size=0) at logging.c:254
#2  0x00007f0797aa6989 in gf_log_disable_suppression_before_exit (ctx=0x21d4010) at logging.c:426
#3  0x00007f0797ac725f in gf_print_trace (signum=11, ctx=0x21d4010) at common-utils.c:579
#4  0x000000000040976e in glusterfsd_print_trace (signum=11) at glusterfsd.c:2021
#5  <signal handler called>
#6  0x0000000000000000 in ?? ()
#7  0x00007f078c32579c in dht_selfheal_dir_setattr (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, stbuf=0x7f07803e204c, valid=-1, 
    layout=0x7f07883a3810) at dht-selfheal.c:1078
#8  0x00007f078c325e73 in dht_selfheal_dir_mkdir (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, layout=0x7f07883a3810, force=0)
    at dht-selfheal.c:1209
#9  0x00007f078c327a1c in dht_selfheal_directory (frame=0x7f0796921ca4, dir_cbk=0x7f078c334b8c <dht_lookup_selfheal_cbk>, 
    loc=0x7f07803e1fbc, layout=0x7f07883a3810) at dht-selfheal.c:1823
#10 0x00007f078c337044 in dht_lookup_dir_cbk (frame=0x7f0796921ca4, cookie=0x7f0796923ee8, this=0x7f07803c5750, op_ret=-1, 
    op_errno=116, inode=0x0, stbuf=0x7f0780571564, xattr=0x0, postparent=0x7f0780571794) at dht-common.c:665
#11 0x00007f078c337310 in dht_lookup_dir_cbk (frame=0x7f0796923ee8, cookie=0x7f079692087c, this=0x7f07803c49b0, op_ret=-1, 
    op_errno=116, inode=0x0, stbuf=0x0, xattr=0x0, postparent=0x0) at dht-common.c:655
#12 0x00007f078c5ea4ce in afr_lookup_do (frame=0x7f079692087c, this=0x7f07803c3c20, err=-116) at afr-common.c:2315
#13 0x00007f078c5e538c in afr_inode_refresh_done (frame=0x7f079692087c, this=0x7f07803c3c20) at afr-common.c:839
#14 0x00007f078c5e55e5 in afr_inode_refresh_subvol_cbk (frame=0x7f079692087c, cookie=0x1, this=0x7f07803c3c20, op_ret=-1, 
    op_errno=116, inode=0x7f076abc7628, buf=0x7f0785e6c820, xdata=0x0, par=0x7f0785e6c7b0) at afr-common.c:869
#15 0x00007f078c8358d3 in client3_3_lookup_cbk (req=0x7f07814570ac, iov=0x7f07814570ec, count=1, myframe=0x7f0796920d30)
    at client-rpc-fops.c:2978
#16 0x00007f079786f5e3 in rpc_clnt_handle_reply (clnt=0x7f0781456e20, pollin=0x7f078002aab0) at rpc-clnt.c:766
#17 0x00007f079786fa81 in rpc_clnt_notify (trans=0x7f07814b0200, mydata=0x7f0781456e50, event=RPC_TRANSPORT_MSG_RECEIVED, 
    data=0x7f078002aab0) at rpc-clnt.c:907
#18 0x00007f079786bbaf in rpc_transport_notify (this=0x7f07814b0200, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f078002aab0)
    at rpc-transport.c:544
#19 0x00007f078d67edb5 in socket_event_poll_in (this=0x7f07814b0200) at socket.c:2236
#20 0x00007f078d67f30b in socket_event_handler (fd=16, idx=6, data=0x7f07814b0200, poll_in=1, poll_out=0, poll_err=0)


(gdb) f 7
#7  0x00007f078c32579c in dht_selfheal_dir_setattr (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, stbuf=0x7f07803e204c, valid=-1, 
    layout=0x7f07883a3810) at dht-selfheal.c:1078
1078	                        STACK_WIND (frame, dht_selfheal_dir_setattr_cbk,
(gdb) p layout
$10 = (dht_layout_t *) 0x7f07883a3810
(gdb) p *layout
$11 = {spread_cnt = 1937076852, cnt = 778331508, preset = 1937075303, commit_hash = 1718773108, gen = 1751395955, 
  type = -267583372, ref = 47789, search_unhashed = _gf_false, list = 0x7f07883a3810}
(gdb) 


The layout structure is corrupt. Why it is corrupt needs further analysis.

--- Additional comment from Nithya Balachandran on 2015-10-07 07:33:41 EDT ---

The issue is consistently reproducible and appears to be an issue with the frames. Several cores have shown an incorrect frame for the tier-dht layer (where frame->this != tier-dht-xlator) as well as local != frame->local . This requires further analysis to figure out why.

--- Additional comment from Nithya Balachandran on 2015-10-19 06:06:45 EDT ---

The issue is not reproducible with http://review.gluster.org/#/c/12184/. However, there is an issue which was exposed with the earlier codepath and I am keeping this BZ open to track it.

This should not hold up tier QE.

Note You need to log in before you can comment on or make changes to this bug.