1399100 – GlusterFS client crashes during remove-brick operation

Bug 1399100 - GlusterFS client crashes during remove-brick operation

Summary: GlusterFS client crashes during remove-brick operation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Raghavendra G
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528 1399134 1399422 1399423 1399424
TreeView+	depends on / blocked

Reported:	2016-11-28 09:44 UTC by Prasad Desala
Modified:	2017-03-23 05:51 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-8
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1399134 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:51:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Prasad Desala 2016-11-28 09:44:11 UTC

Description of problem:
=======================
GlusterFS client crashed generating below bt when the below steps are performed

(gdb) bt
#0  0x00007fcb2c58660b in transit_state_mb (pstate=<optimized out>, pstate=<optimized out>, mctx=0x7fcb20be5470) at regexec.c:2530
#1  transit_state (state=0x7fcb1ec9c770, mctx=0x7fcb20be5470, err=0x7fcb20be5420) at regexec.c:2285
#2  check_matching (p_match_first=0x7fcb20be5410, fl_longest_match=1, mctx=0x7fcb20be5470) at regexec.c:1171
#3  re_search_internal (preg=preg@entry=0x7fcb14071ba8, string=string@entry=0x7fcae6b9f138 "..", length=2, start=<optimized out>, start@entry=0, range=0, stop=<optimized out>, 
    nmatch=<optimized out>, pmatch=0x7fcb20be55d0, eflags=0) at regexec.c:842
#4  0x00007fcb2c58c1f5 in __regexec (preg=0x7fcb14071ba8, string=0x7fcae6b9f138 "..", nmatch=<optimized out>, pmatch=0x7fcb20be55d0, eflags=<optimized out>) at regexec.c:250
#5  0x00007fcb1bb288c9 in dht_munge_name (original=original@entry=0x7fcae6b9f138 "..", modified=modified@entry=0x7fcb20be5640 ".", len=len@entry=3, re=re@entry=0x7fcb14071ba8)
    at dht-hashfn.c:49
#6  0x00007fcb1bb28ace in dht_hash_compute (this=this@entry=0x7fcad57aec20, type=0, name=name@entry=0x7fcae6b9f138 "..", hash_p=hash_p@entry=0x7fcb20be56f4) at dht-hashfn.c:86
#7  0x00007fcb1bb08c56 in dht_layout_search (this=0x7fcad57aec20, layout=0x7fcb1c4ebe30, name=0x7fcae6b9f138 "..") at dht-layout.c:166
#8  0x00007fcb1bb311bb in dht_readdirp_cbk (frame=0x7fcb2b8e24e4, cookie=0x7fcb2b8e0714, this=0x7fcad57aec20, op_ret=2, op_errno=2, orig_entries=0x7fcb20be58f0, xdata=0x0)
    at dht-common.c:4780
#9  0x00007fcb1bd8f97c in afr_readdir_cbk (frame=<optimized out>, cookie=<optimized out>, this=<optimized out>, op_ret=2, op_errno=2, subvol_entries=<optimized out>, xdata=0x0)
    at afr-dir-read.c:234
#10 0x00007fcb201ae7a1 in client3_3_readdirp_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fcb2b8de028) at client-rpc-fops.c:2650
#11 0x00007fcb2dbcd680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fcb15a900d0, pollin=pollin@entry=0x7fcb1ee30a20) at rpc-clnt.c:791
#12 0x00007fcb2dbcd95f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fcb15a90100, event=<optimized out>, data=0x7fcb1ee30a20) at rpc-clnt.c:962
#13 0x00007fcb2dbc9883 in rpc_transport_notify (this=this@entry=0x7fcad6c09480, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fcb1ee30a20) at rpc-transport.c:537
#14 0x00007fcb2248eec4 in socket_event_poll_in (this=this@entry=0x7fcad6c09480) at socket.c:2267
#15 0x00007fcb22491375 in socket_event_handler (fd=<optimized out>, idx=46, data=0x7fcad6c09480, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397
#16 0x00007fcb2de5d3b0 in event_dispatch_epoll_handler (event=0x7fcb20be5e80, event_pool=0x7fcb30174f00) at event-epoll.c:571
#17 event_dispatch_epoll_worker (data=0x7fcb301bf8a0) at event-epoll.c:674
#18 0x00007fcb2cc64dc5 in start_thread (arg=0x7fcb20be6700) at pthread_create.c:308
#19 0x00007fcb2c5a973d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Version-Release number of selected component (if applicable):
3.8.4-5.el7rhgs.x86_64

Steps to Reproduce:
===================
1) Create a Distriuted-Repicate volume and start it.
2) FUSE mount the volume on multiple clients.
3) Start creating a big file from one client and do continuous lookups from other clients (find, stat *, ls -lRt)
	 "dd if=/dev/urandom of=BIG bs=1024k count=10000"
4) With Step-3 running, Identify on which bricks the file is actually stored and remove those bricks.

The client running continuous "find" commands got crashed.
The volume got unmounted and seen "Transport endpoint is not connected" errors as  below

find: failed to restore initial working directory: Transport endpoint is not connected
find: ‘.’: Transport endpoint is not connected
find: failed to restore initial working directory: Transport endpoint is not connected
find: ‘.’: Transport endpoint is not connected

FUSE mount logs:
================
[2016-11-28 07:14:47.325198] I [MSGID: 109086] [dht-shared.c:297:dht_parse_decommissioned_bricks] 25-distrep-dht: decommissioning subvolume distrep-replicate-2
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(READDIRP)
frame : type(1) op(OPENDIR)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-11-28 07:14:47
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7fcb2de03bd2]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7fcb2de0d654]
/lib64/libc.so.6(+0x35250)[0x7fcb2c4e7250]
/lib64/libc.so.6(+0xd460b)[0x7fcb2c58660b]
/lib64/libc.so.6(regexec+0xc5)[0x7fcb2c58c1f5]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x258c9)[0x7fcb1bb288c9]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x25ace)[0x7fcb1bb28ace]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x5c56)[0x7fcb1bb08c56]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x2e1bb)[0x7fcb1bb311bb]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x697c)[0x7fcb1bd8f97c]
/usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so(+0x207a1)[0x7fcb201ae7a1]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fcb2dbcd680]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1df)[0x7fcb2dbcd95f]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fcb2dbc9883]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x6ec4)[0x7fcb2248eec4]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9375)[0x7fcb22491375]
/lib64/libglusterfs.so.0(+0x833b0)[0x7fcb2de5d3b0]
/lib64/libpthread.so.0(+0x7dc5)[0x7fcb2cc64dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fcb2c5a973d]
---------

Actual results:
===============
Client crashed.

Expected results:
=================
There should not be any crashes.

Comment 5 Atin Mukherjee 2016-11-28 11:35:08 UTC

upstream patch http://review.gluster.org/15945 posted for review.

Comment 11 Atin Mukherjee 2016-12-09 07:16:34 UTC

master: http://review.gluster.org/#/c/15945/ 
release-3.8 : http://review.gluster.org/#/c/15793/
release-3.9 : http://review.gluster.org/#/c/15949/

downstream patch : https://code.engineering.redhat.com/gerrit/92555

Comment 13 Prasad Desala 2016-12-29 12:28:13 UTC

Repeated the steps in the description thrice with glusterfs version 3.8.4-10.el7rhgs.x86_64 and client crashes are not seen. 

Hence, moving this BZ to verified.

Comment 15 errata-xmlrpc 2017-03-23 05:51:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.