Bug 1095671

Summary:	dist-geo-rep: while doing rm -rf on master mount, slave glusterfs crashed in io-cache.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vijaykumar Koppad <vkoppad>
Component:	geo-replication	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED ERRATA	QA Contact:	Bhaskar Bandari <bbandari>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	2.1	CC:	aavati, bbandari, csaba, david.macdonald, nlevinki, nsathyan, rmainz, shaines, ssamanta, vagarwal, vbellur, vumrao
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	3.4.0.61rhs	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1095888 1126369 (view as bug list)		Environment:
Last Closed:	2014-09-22 19:36:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1095888, 1126369

Description Vijaykumar Koppad 2014-05-08 11:12:17 UTC

Description of problem: While doing rm -rf on master, the slave glusterfs process crashed in io-cache. Once this crash happens, every time the worker spawns it crashes on the slave glusterfs process.  

bt in log-file 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2014-05-08 10:48:21.638483] W [fuse-bridge.c:1628:fuse_err_cbk] 0-glusterfs-fuse: 6: MKDIR() /level00 => -1 (File exists)
[2014-05-08 10:48:21.641337] E [dht-helper.c:1144:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_lookup_selfheal_cbk+0x1d6) [0x7f0c9ce7aab6] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_layout_set+0x4e) [0x7f0c9ce6260e] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_inode_ctx_layout_get+0x1b) [0x7f0c9ce73a8b]))) 0-slave-dht: invalid argument: inode
[2014-05-08 10:48:21.641404] E [dht-helper.c:1144:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_lookup_selfheal_cbk+0x1d6) [0x7f0c9ce7aab6] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_layout_set+0x63) [0x7f0c9ce62623] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x34) [0x7f0c9ce62de4]))) 0-slave-dht: invalid argument: inode
[2014-05-08 10:48:21.641452] E [dht-helper.c:1163:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_lookup_selfheal_cbk+0x1d6) [0x7f0c9ce7aab6] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_layout_set+0x63) [0x7f0c9ce62623] (-->/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x52) [0x7f0c9ce62e02]))) 0-slave-dht: invalid argument: inode
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2014-05-08 10:48:21configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.60rhs
/lib64/libc.so.6[0x3584a329a0]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x358520c380]
/usr/lib64/glusterfs/3.4.0.60rhs/xlator/performance/io-cache.so(ioc_lookup_cbk+0x87)[0x7f0c9c837e07]
/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_lookup_selfheal_cbk+0x17b)[0x7f0c9ce7aa5b]
/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_selfheal_dir_finish+0x20)[0x7f0c9ce6ae60]
/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_selfheal_directory_for_nameless_lookup+0x3ff)[0x7f0c9ce6c71f]
/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/distribute.so(dht_discover_cbk+0x273)[0x7f0c9ce896a3]
/usr/lib64/glusterfs/3.4.0.60rhs/xlator/cluster/replicate.so(afr_lookup_cbk+0x558)[0x7f0c9d0ffb58]
/usr/lib64/glusterfs/3.4.0.60rhs/xlator/protocol/client.so(client3_3_lookup_cbk+0x633)[0x7f0c9d33ca33]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7f0ca1b8bf45]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7f0ca1b8d507]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f0ca1b88d88]
/usr/lib64/glusterfs/3.4.0.60rhs/rpc-transport/socket.so(+0x8dc6)[0x7f0c9e178dc6]
/usr/lib64/glusterfs/3.4.0.60rhs/rpc-transport/socket.so(+0xa6dd)[0x7f0c9e17a6dd]
/usr/lib64/libglusterfs.so.0(+0x62457)[0x7f0ca1df8457]
/usr/sbin/glusterfs(main+0x6c7)[0x4069d7]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3584a1ed1d]
/usr/sbin/glusterfs[0x404619]

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>



Version-Release number of selected component (if applicable): glusterfs-3.4.0.60rhs-1.el6rhs.x86_64


How reproducible: Happened once. 


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.create some 100K files on master over 10x10 directory. 
use crefi "crefi T 10 -n 100 --multi -b 10 -d 10 --random --min=1K --max=10K /mnt/master"
3.let them sync to slave. 
4. then run rm -rf on master mount point

Actual results: some of the files failed to remove from slave and slave glusterfs process crashed. 


Expected results: It shouldn't fail to remove files and also shouldn't crash 


Additional info:

bt from core
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Core was generated by `/usr/sbin/glusterfs --aux-gfid-mount --log-file=/var/log/glusterfs/geo-replicat'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000358520c380 in pthread_spin_lock () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-15.el6_5.1.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 openssl-1.0.1e-16.el6_5.4.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x000000358520c380 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007f71a302de07 in ioc_lookup_cbk (frame=0x7f71a73ea4e0, cookie=<value optimized out>, this=0x1a9afd0, op_ret=0, op_errno=2, inode=0x0, stbuf=0x7f71a28fc0c4, xdata=0x0, postparent=0x7f71a28fc2f4) at io-cache.c:207
#2  0x00007f71a3670a5b in dht_lookup_selfheal_cbk (frame=0x7f71a73ea02c, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, xdata=<value optimized out>)
    at dht-common.c:141
#3  0x00007f71a3660e60 in dht_selfheal_dir_finish (frame=<value optimized out>, this=<value optimized out>, ret=<value optimized out>) at dht-selfheal.c:72
#4  0x00007f71a366271f in dht_selfheal_dir_xattr_for_nameless_lookup (frame=0x7f71a73ea02c, dir_cbk=<value optimized out>, loc=<value optimized out>, layout=0x7f7194002240) at dht-selfheal.c:416
#5  dht_selfheal_directory_for_nameless_lookup (frame=0x7f71a73ea02c, dir_cbk=<value optimized out>, loc=<value optimized out>, layout=0x7f7194002240) at dht-selfheal.c:1174
#6  0x00007f71a367f6a3 in dht_discover_cbk (frame=0x7f71a721fc18, cookie=0x7f71a73ea6e4, this=0x1a97c60, op_ret=<value optimized out>, op_errno=2, inode=0x0, stbuf=0x7f71a22e885c, xattr=0x0, postparent=0x7f71a22e88cc) at dht-common.c:341
#7  0x00007f71a38f5b58 in afr_lookup_done (frame=0x7f71a73ea6e4, cookie=0x1, this=0x1a971d0, op_ret=<value optimized out>, op_errno=2, inode=0x7f71a1391164, buf=0x7fff787383e0, xattr=0x0, postparent=0x7fff78738370) at afr-common.c:2220
#8  afr_lookup_cbk (frame=0x7f71a73ea6e4, cookie=0x1, this=0x1a971d0, op_ret=<value optimized out>, op_errno=2, inode=0x7f71a1391164, buf=0x7fff787383e0, xattr=0x0, postparent=0x7fff78738370) at afr-common.c:2451
#9  0x00007f71a3b32a33 in client3_3_lookup_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f71a73ea83c) at client-rpc-fops.c:2610
#10 0x00007f71a8381f45 in rpc_clnt_handle_reply (clnt=0x1ac1ee0, pollin=0x1a8c600) at rpc-clnt.c:773
#11 0x00007f71a8383507 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x1ac1f10, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:906
#12 0x00007f71a837ed88 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512
#13 0x00007f71a496edc6 in socket_event_poll_in (this=0x1ad1970) at socket.c:2119
#14 0x00007f71a49706dd in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1ad1970, poll_in=1, poll_out=0, poll_err=0) at socket.c:2229
#15 0x00007f71a85ee457 in event_dispatch_epoll_handler (event_pool=0x1a52ee0) at event-epoll.c:384
#16 event_dispatch_epoll (event_pool=0x1a52ee0) at event-epoll.c:445
#17 0x00000000004069d7 in main (argc=7, argv=0x7fff7873a088) at glusterfsd.c:2050
(gdb) f o
No symbol "o" in current context.
(gdb) f 0
#0  0x000000358520c380 in pthread_spin_lock () from /lib64/libpthread.so.0
(gdb) f 1
#1  0x00007f71a302de07 in ioc_lookup_cbk (frame=0x7f71a73ea4e0, cookie=<value optimized out>, this=0x1a9afd0, op_ret=0, op_errno=2, inode=0x0, stbuf=0x7f71a28fc0c4, xdata=0x0, postparent=0x7f71a28fc2f4) at io-cache.c:207
207             LOCK (&inode->lock);
(gdb) f 2
#2  0x00007f71a3670a5b in dht_lookup_selfheal_cbk (frame=0x7f71a73ea02c, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, xdata=<value optimized out>)
    at dht-common.c:141
141             DHT_STACK_UNWIND (lookup, frame, ret, local->op_errno, local->inode,
(gdb) f 3
#3  0x00007f71a3660e60 in dht_selfheal_dir_finish (frame=<value optimized out>, this=<value optimized out>, ret=<value optimized out>) at dht-selfheal.c:72
72              local->selfheal.dir_cbk (frame, NULL, frame->this, ret,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 4 Vijay Bellur 2014-06-16 05:44:29 UTC

Removing blocker+ and rhs-3.0.0+ as this bug is not relevant to RHS 3.0.

Comment 6 Vijaykumar Koppad 2014-08-15 18:21:15 UTC

verified on the build 3.4.0.65rhs

Comment 10 errata-xmlrpc 2014-09-22 19:36:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html