Bug 1278355

Summary: Tier: volume FUSE client crashes when running a find during attach tier
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: tierAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED WONTFIX QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: hgowtham, nbalacha, rcyriac, rhs-bugs, sankarshan, smohan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tier-attach-detach
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1263532 Environment:
Last Closed: 2018-11-08 18:57:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1263532    
Bug Blocks: 1260923    

Description Nag Pavan Chilakam 2015-11-05 10:27:47 UTC
+++ This bug was initially created as a clone of Bug #1263532 +++

Description of problem:

The FUSE client process for a tiered volume crashes on concurrent attach-tier and find operations

Version-Release number of selected component (if applicable):


How reproducible:
Random but pretty common

Steps to Reproduce:
1. Create a distribute-replicate volume
2. Fuse mount the volume and untar a large file (I used a tar of the glusterfs source code)
3. While the untar is in progress, attach a dist-rep hot tier or 4 bricks
4. Run find <mnt-path>


Actual results:
The FUSE client process will crash

Expected results:
The client process should not crash

Additional info:

--- Additional comment from Nithya Balachandran on 2015-09-16 02:32:30 EDT ---

Back trace:

#0  0x00000033aa609420 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f0797aa6472 in gf_log_set_log_buf_size (buf_size=0) at logging.c:254
#2  0x00007f0797aa6989 in gf_log_disable_suppression_before_exit (ctx=0x21d4010) at logging.c:426
#3  0x00007f0797ac725f in gf_print_trace (signum=11, ctx=0x21d4010) at common-utils.c:579
#4  0x000000000040976e in glusterfsd_print_trace (signum=11) at glusterfsd.c:2021
#5  <signal handler called>
#6  0x0000000000000000 in ?? ()
#7  0x00007f078c32579c in dht_selfheal_dir_setattr (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, stbuf=0x7f07803e204c, valid=-1, 
    layout=0x7f07883a3810) at dht-selfheal.c:1078
#8  0x00007f078c325e73 in dht_selfheal_dir_mkdir (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, layout=0x7f07883a3810, force=0)
    at dht-selfheal.c:1209
#9  0x00007f078c327a1c in dht_selfheal_directory (frame=0x7f0796921ca4, dir_cbk=0x7f078c334b8c <dht_lookup_selfheal_cbk>, 
    loc=0x7f07803e1fbc, layout=0x7f07883a3810) at dht-selfheal.c:1823
#10 0x00007f078c337044 in dht_lookup_dir_cbk (frame=0x7f0796921ca4, cookie=0x7f0796923ee8, this=0x7f07803c5750, op_ret=-1, 
    op_errno=116, inode=0x0, stbuf=0x7f0780571564, xattr=0x0, postparent=0x7f0780571794) at dht-common.c:665
#11 0x00007f078c337310 in dht_lookup_dir_cbk (frame=0x7f0796923ee8, cookie=0x7f079692087c, this=0x7f07803c49b0, op_ret=-1, 
    op_errno=116, inode=0x0, stbuf=0x0, xattr=0x0, postparent=0x0) at dht-common.c:655
#12 0x00007f078c5ea4ce in afr_lookup_do (frame=0x7f079692087c, this=0x7f07803c3c20, err=-116) at afr-common.c:2315
#13 0x00007f078c5e538c in afr_inode_refresh_done (frame=0x7f079692087c, this=0x7f07803c3c20) at afr-common.c:839
#14 0x00007f078c5e55e5 in afr_inode_refresh_subvol_cbk (frame=0x7f079692087c, cookie=0x1, this=0x7f07803c3c20, op_ret=-1, 
    op_errno=116, inode=0x7f076abc7628, buf=0x7f0785e6c820, xdata=0x0, par=0x7f0785e6c7b0) at afr-common.c:869
#15 0x00007f078c8358d3 in client3_3_lookup_cbk (req=0x7f07814570ac, iov=0x7f07814570ec, count=1, myframe=0x7f0796920d30)
    at client-rpc-fops.c:2978
#16 0x00007f079786f5e3 in rpc_clnt_handle_reply (clnt=0x7f0781456e20, pollin=0x7f078002aab0) at rpc-clnt.c:766
#17 0x00007f079786fa81 in rpc_clnt_notify (trans=0x7f07814b0200, mydata=0x7f0781456e50, event=RPC_TRANSPORT_MSG_RECEIVED, 
    data=0x7f078002aab0) at rpc-clnt.c:907
#18 0x00007f079786bbaf in rpc_transport_notify (this=0x7f07814b0200, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f078002aab0)
    at rpc-transport.c:544
#19 0x00007f078d67edb5 in socket_event_poll_in (this=0x7f07814b0200) at socket.c:2236
#20 0x00007f078d67f30b in socket_event_handler (fd=16, idx=6, data=0x7f07814b0200, poll_in=1, poll_out=0, poll_err=0)


(gdb) f 7
#7  0x00007f078c32579c in dht_selfheal_dir_setattr (frame=0x7f0796921ca4, loc=0x7f07803e1fbc, stbuf=0x7f07803e204c, valid=-1, 
    layout=0x7f07883a3810) at dht-selfheal.c:1078
1078	                        STACK_WIND (frame, dht_selfheal_dir_setattr_cbk,
(gdb) p layout
$10 = (dht_layout_t *) 0x7f07883a3810
(gdb) p *layout
$11 = {spread_cnt = 1937076852, cnt = 778331508, preset = 1937075303, commit_hash = 1718773108, gen = 1751395955, 
  type = -267583372, ref = 47789, search_unhashed = _gf_false, list = 0x7f07883a3810}
(gdb) 


The layout structure is corrupt. Why it is corrupt needs further analysis.

--- Additional comment from Nithya Balachandran on 2015-10-07 07:33:41 EDT ---

The issue is consistently reproducible and appears to be an issue with the frames. Several cores have shown an incorrect frame for the tier-dht layer (where frame->this != tier-dht-xlator) as well as local != frame->local . This requires further analysis to figure out why.

--- Additional comment from Nithya Balachandran on 2015-10-19 06:06:45 EDT ---

The issue is not reproducible with http://review.gluster.org/#/c/12184/. However, there is an issue which was exposed with the earlier codepath and I am keeping this BZ open to track it.

This should not hold up tier QE.

Comment 7 hari gowtham 2018-11-08 18:57:35 UTC
As tier is not being actively developed, I'm closing this bug. Feel free to open it if necessary.