Bug 1408621

Summary: rebalance operation because of remove-brick failed on one of the cluster node
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Byreddy <bsrirama>
Component: disperseAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.2CC: amukherj, aspandey, pkarampu, rhs-bugs, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1417535 (view as bug list) Environment:
Last Closed: 2018-11-09 10:58:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1417535    
Bug Blocks:    

Description Byreddy 2016-12-26 04:30:28 UTC
Description of problem:
=======================
rebalance operation because of remove-brick  failed on one of the cluster node

rebalance warning and error messages:
-------------------------------------
[2016-12-23 07:05:52.568409] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-Disperse1-client-12: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-12-23 07:05:52.569163] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-Disperse1-client-12: Connected to Disperse1-client-12, attached to remote volume '/bricks/bric
k2/a0'.
[2016-12-23 07:05:52.569189] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-Disperse1-client-12: Server and Client lk-version numbers are not same, reopening the fds
[2016-12-23 07:05:52.570742] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-Disperse1-client-12: Server lk version = 1
[2016-12-23 07:05:55.203018] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.767154] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.992148] W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting
The message "W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting" repeated 2 times between [2016-12-23 07:06:01.992148] and [2016-12-23 07:06:02.992415]
[2016-12-23 07:06:02.997440] E [MSGID: 109027] [dht-rebalance.c:3696:gf_defrag_start_crawl] 0-Disperse1-dht: Failed to start rebalance: look up on / failed
[2016-12-23 07:06:02.997723] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-Disperse1-dht: Rebalance is failed. Time taken is 0.00 secs
[2016-12-23 07:06:02.997747] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-Disperse1-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2016-12-23 07:06:02.997986] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x3c14607aa1) [0x7fd0233aaaa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fd0247bc3f5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fd0247bbee6] ) 0-: received signum (15), shutting down




Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-9.el6rhs.x86_64


How reproducible:
=================
2/3


Steps to Reproduce:
====================
1. Have 6 node cluster
2. Create a 2 * (4+2) volume and fuse mount it.
3. Keep writing the data at the mount point //untar linux kernel
4. Add one more sub volume to make 3 * (4+2)
5. Once untar is over, remove the last added sub volume.  //during this step, rebalance failed in one of node 


Actual results:
===============
rebalance operation because of remove-brick  failed on one of the cluster node

Expected results:
=================
Rebalance should start wihtout issue when volume bricks having data are removed.

Additional info:
================
This issue not reproducible always and Live setup was showed to one of the DHT team member to get some idea about the issue.

and 

This issue was found while testing -9 build of 3.2.0.

Comment 15 Nithya Balachandran 2017-01-24 08:41:01 UTC
This needs information from the EC team. Ashish/Pranith, could one of you take a look at why the CHILD_DOWN events were sent?

Thanks,
Nithya