Bug 1417535

Summary:	rebalance operation because of remove-brick failed on one of the cluster node
Product:	[Community] GlusterFS	Reporter:	Ashish Pandey <aspandey>
Component:	disperse	Assignee:	bugs <bugs>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	mainline	CC:	aspandey, atumball, bugs, Carlos.Hung, nbalacha, pkarampu, rhs-bugs, tdesala
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-6.x	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1408621	Environment:
Last Closed:	2019-05-11 09:54:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1408621

Comment 1 Ashish Pandey 2017-01-30 07:03:56 UTC

Description of problem:
=======================
rebalance operation because of remove-brick  failed on one of the cluster node

rebalance warning and error messages:
-------------------------------------
[2016-12-23 07:05:52.568409] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-Disperse1-client-12: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-12-23 07:05:52.569163] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-Disperse1-client-12: Connected to Disperse1-client-12, attached to remote volume '/bricks/bric
k2/a0'.
[2016-12-23 07:05:52.569189] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-Disperse1-client-12: Server and Client lk-version numbers are not same, reopening the fds
[2016-12-23 07:05:52.570742] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-Disperse1-client-12: Server lk version = 1
[2016-12-23 07:05:55.203018] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.767154] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.992148] W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting
The message "W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting" repeated 2 times between [2016-12-23 07:06:01.992148] and [2016-12-23 07:06:02.992415]
[2016-12-23 07:06:02.997440] E [MSGID: 109027] [dht-rebalance.c:3696:gf_defrag_start_crawl] 0-Disperse1-dht: Failed to start rebalance: look up on / failed
[2016-12-23 07:06:02.997723] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-Disperse1-dht: Rebalance is failed. Time taken is 0.00 secs
[2016-12-23 07:06:02.997747] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-Disperse1-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2016-12-23 07:06:02.997986] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x3c14607aa1) [0x7fd0233aaaa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fd0247bc3f5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fd0247bbee6] ) 0-: received signum (15), shutting down




Version-Release number of selected component (if applicable):
=============================================================



How reproducible:
=================
2/3


Steps to Reproduce:
====================
1. Have 6 node cluster
2. Create a 2 * (4+2) volume and fuse mount it.
3. Keep writing the data at the mount point //untar linux kernel
4. Add one more sub volume to make 3 * (4+2)
5. Once untar is over, remove the last added sub volume.  //during this step, rebalance failed in one of node 


Actual results:
===============
rebalance operation because of remove-brick  failed on one of the cluster node

Expected results:
=================
Rebalance should start wihtout issue when volume bricks having data are removed.

Additional info:
================
This issue not reproducible always and Live setup was showed to one of the DHT team member to get some idea about the issue.

Comment 2 Nithya Balachandran 2017-08-29 05:17:37 UTC

Please see the comment:
> The rebalance process received a CHILD_DOWN event so it will terminate. This
> is the expected behaviour.
> 
> The EC team needs to look into why the EC subvol returned a CHILD_DOWN event.
> 
> 


Moving this to the EC team to take a look.

Comment 3 Amar Tumballi 2019-05-10 12:39:18 UTC

Ashish, did we finally fix this? Whats the latest on this?

Comment 4 Ashish Pandey 2019-05-11 09:03:39 UTC

Yes, We have tested  last few releases and did not see this issue
I think this issue has been fixed and we can close this.

Comment 5 Carlos 2019-11-29 02:27:24 UTC

Hi Ashish(In reply to Ashish Pandey from comment #4)
> Yes, We have tested  last few releases and did not see this issue
> I think this issue has been fixed and we can close this.


May I know which commit solves this issue?