Bug 1399100

Summary:	GlusterFS client crashes during remove-brick operation
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Prasad Desala <tdesala>
Component:	distribute	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED ERRATA	QA Contact:	Prasad Desala <tdesala>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, nbalacha, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-8	Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:
Clones:	1399134 (view as bug list)		Environment:
Last Closed:	2017-03-23 05:51:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351528, 1399134, 1399422, 1399423, 1399424

Description Prasad Desala 2016-11-28 09:44:11 UTC

Description of problem:
=======================
GlusterFS client crashed generating below bt when the below steps are performed

(gdb) bt
#0  0x00007fcb2c58660b in transit_state_mb (pstate=<optimized out>, pstate=<optimized out>, mctx=0x7fcb20be5470) at regexec.c:2530
#1  transit_state (state=0x7fcb1ec9c770, mctx=0x7fcb20be5470, err=0x7fcb20be5420) at regexec.c:2285
#2  check_matching (p_match_first=0x7fcb20be5410, fl_longest_match=1, mctx=0x7fcb20be5470) at regexec.c:1171
#3  re_search_internal (preg=preg@entry=0x7fcb14071ba8, string=string@entry=0x7fcae6b9f138 "..", length=2, start=<optimized out>, start@entry=0, range=0, stop=<optimized out>, 
    nmatch=<optimized out>, pmatch=0x7fcb20be55d0, eflags=0) at regexec.c:842
#4  0x00007fcb2c58c1f5 in __regexec (preg=0x7fcb14071ba8, string=0x7fcae6b9f138 "..", nmatch=<optimized out>, pmatch=0x7fcb20be55d0, eflags=<optimized out>) at regexec.c:250
#5  0x00007fcb1bb288c9 in dht_munge_name (original=original@entry=0x7fcae6b9f138 "..", modified=modified@entry=0x7fcb20be5640 ".", len=len@entry=3, re=re@entry=0x7fcb14071ba8)
    at dht-hashfn.c:49
#6  0x00007fcb1bb28ace in dht_hash_compute (this=this@entry=0x7fcad57aec20, type=0, name=name@entry=0x7fcae6b9f138 "..", hash_p=hash_p@entry=0x7fcb20be56f4) at dht-hashfn.c:86
#7  0x00007fcb1bb08c56 in dht_layout_search (this=0x7fcad57aec20, layout=0x7fcb1c4ebe30, name=0x7fcae6b9f138 "..") at dht-layout.c:166
#8  0x00007fcb1bb311bb in dht_readdirp_cbk (frame=0x7fcb2b8e24e4, cookie=0x7fcb2b8e0714, this=0x7fcad57aec20, op_ret=2, op_errno=2, orig_entries=0x7fcb20be58f0, xdata=0x0)
    at dht-common.c:4780
#9  0x00007fcb1bd8f97c in afr_readdir_cbk (frame=<optimized out>, cookie=<optimized out>, this=<optimized out>, op_ret=2, op_errno=2, subvol_entries=<optimized out>, xdata=0x0)
    at afr-dir-read.c:234
#10 0x00007fcb201ae7a1 in client3_3_readdirp_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fcb2b8de028) at client-rpc-fops.c:2650
#11 0x00007fcb2dbcd680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fcb15a900d0, pollin=pollin@entry=0x7fcb1ee30a20) at rpc-clnt.c:791
#12 0x00007fcb2dbcd95f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fcb15a90100, event=<optimized out>, data=0x7fcb1ee30a20) at rpc-clnt.c:962
#13 0x00007fcb2dbc9883 in rpc_transport_notify (this=this@entry=0x7fcad6c09480, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fcb1ee30a20) at rpc-transport.c:537
#14 0x00007fcb2248eec4 in socket_event_poll_in (this=this@entry=0x7fcad6c09480) at socket.c:2267
#15 0x00007fcb22491375 in socket_event_handler (fd=<optimized out>, idx=46, data=0x7fcad6c09480, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397
#16 0x00007fcb2de5d3b0 in event_dispatch_epoll_handler (event=0x7fcb20be5e80, event_pool=0x7fcb30174f00) at event-epoll.c:571
#17 event_dispatch_epoll_worker (data=0x7fcb301bf8a0) at event-epoll.c:674
#18 0x00007fcb2cc64dc5 in start_thread (arg=0x7fcb20be6700) at pthread_create.c:308
#19 0x00007fcb2c5a973d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Version-Release number of selected component (if applicable):
3.8.4-5.el7rhgs.x86_64

Steps to Reproduce:
===================
1) Create a Distriuted-Repicate volume and start it.
2) FUSE mount the volume on multiple clients.
3) Start creating a big file from one client and do continuous lookups from other clients (find, stat *, ls -lRt)
	 "dd if=/dev/urandom of=BIG bs=1024k count=10000"
4) With Step-3 running, Identify on which bricks the file is actually stored and remove those bricks.

The client running continuous "find" commands got crashed.
The volume got unmounted and seen "Transport endpoint is not connected" errors as  below

find: failed to restore initial working directory: Transport endpoint is not connected
find: ‘.’: Transport endpoint is not connected
find: failed to restore initial working directory: Transport endpoint is not connected
find: ‘.’: Transport endpoint is not connected

FUSE mount logs:
================
[2016-11-28 07:14:47.325198] I [MSGID: 109086] [dht-shared.c:297:dht_parse_decommissioned_bricks] 25-distrep-dht: decommissioning subvolume distrep-replicate-2
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(READDIRP)
frame : type(1) op(OPENDIR)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-11-28 07:14:47
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7fcb2de03bd2]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7fcb2de0d654]
/lib64/libc.so.6(+0x35250)[0x7fcb2c4e7250]
/lib64/libc.so.6(+0xd460b)[0x7fcb2c58660b]
/lib64/libc.so.6(regexec+0xc5)[0x7fcb2c58c1f5]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x258c9)[0x7fcb1bb288c9]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x25ace)[0x7fcb1bb28ace]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x5c56)[0x7fcb1bb08c56]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x2e1bb)[0x7fcb1bb311bb]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x697c)[0x7fcb1bd8f97c]
/usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so(+0x207a1)[0x7fcb201ae7a1]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fcb2dbcd680]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1df)[0x7fcb2dbcd95f]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fcb2dbc9883]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x6ec4)[0x7fcb2248eec4]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9375)[0x7fcb22491375]
/lib64/libglusterfs.so.0(+0x833b0)[0x7fcb2de5d3b0]
/lib64/libpthread.so.0(+0x7dc5)[0x7fcb2cc64dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fcb2c5a973d]
---------

Actual results:
===============
Client crashed.

Expected results:
=================
There should not be any crashes.

Comment 5 Atin Mukherjee 2016-11-28 11:35:08 UTC

upstream patch http://review.gluster.org/15945 posted for review.

Comment 11 Atin Mukherjee 2016-12-09 07:16:34 UTC

master: http://review.gluster.org/#/c/15945/ 
release-3.8 : http://review.gluster.org/#/c/15793/
release-3.9 : http://review.gluster.org/#/c/15949/

downstream patch : https://code.engineering.redhat.com/gerrit/92555

Comment 13 Prasad Desala 2016-12-29 12:28:13 UTC

Repeated the steps in the description thrice with glusterfs version 3.8.4-10.el7rhgs.x86_64 and client crashes are not seen. 

Hence, moving this BZ to verified.

Comment 15 errata-xmlrpc 2017-03-23 05:51:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html