Bug 1417535

Summary: rebalance operation because of remove-brick failed on one of the cluster node
Product: [Community] GlusterFS Reporter: Ashish Pandey <aspandey>
Component: disperseAssignee: bugs <bugs>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: mainlineCC: aspandey, atumball, bugs, Carlos.Hung, nbalacha, pkarampu, rhs-bugs, tdesala
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-6.x Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1408621 Environment:
Last Closed: 2019-05-11 09:54:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1408621    

Comment 1 Ashish Pandey 2017-01-30 07:03:56 UTC
Description of problem:
=======================
rebalance operation because of remove-brick  failed on one of the cluster node

rebalance warning and error messages:
-------------------------------------
[2016-12-23 07:05:52.568409] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-Disperse1-client-12: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-12-23 07:05:52.569163] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-Disperse1-client-12: Connected to Disperse1-client-12, attached to remote volume '/bricks/bric
k2/a0'.
[2016-12-23 07:05:52.569189] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-Disperse1-client-12: Server and Client lk-version numbers are not same, reopening the fds
[2016-12-23 07:05:52.570742] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-Disperse1-client-12: Server lk version = 1
[2016-12-23 07:05:55.203018] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.767154] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.992148] W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting
The message "W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting" repeated 2 times between [2016-12-23 07:06:01.992148] and [2016-12-23 07:06:02.992415]
[2016-12-23 07:06:02.997440] E [MSGID: 109027] [dht-rebalance.c:3696:gf_defrag_start_crawl] 0-Disperse1-dht: Failed to start rebalance: look up on / failed
[2016-12-23 07:06:02.997723] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-Disperse1-dht: Rebalance is failed. Time taken is 0.00 secs
[2016-12-23 07:06:02.997747] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-Disperse1-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2016-12-23 07:06:02.997986] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x3c14607aa1) [0x7fd0233aaaa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fd0247bc3f5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fd0247bbee6] ) 0-: received signum (15), shutting down




Version-Release number of selected component (if applicable):
=============================================================



How reproducible:
=================
2/3


Steps to Reproduce:
====================
1. Have 6 node cluster
2. Create a 2 * (4+2) volume and fuse mount it.
3. Keep writing the data at the mount point //untar linux kernel
4. Add one more sub volume to make 3 * (4+2)
5. Once untar is over, remove the last added sub volume.  //during this step, rebalance failed in one of node 


Actual results:
===============
rebalance operation because of remove-brick  failed on one of the cluster node

Expected results:
=================
Rebalance should start wihtout issue when volume bricks having data are removed.

Additional info:
================
This issue not reproducible always and Live setup was showed to one of the DHT team member to get some idea about the issue.

Comment 2 Nithya Balachandran 2017-08-29 05:17:37 UTC
Please see the comment:
> The rebalance process received a CHILD_DOWN event so it will terminate. This
> is the expected behaviour.
> 
> The EC team needs to look into why the EC subvol returned a CHILD_DOWN event.
> 
> 


Moving this to the EC team to take a look.

Comment 3 Amar Tumballi 2019-05-10 12:39:18 UTC
Ashish, did we finally fix this? Whats the latest on this?

Comment 4 Ashish Pandey 2019-05-11 09:03:39 UTC
Yes, We have tested  last few releases and did not see this issue
I think this issue has been fixed and we can close this.

Comment 5 Carlos 2019-11-29 02:27:24 UTC
Hi Ashish(In reply to Ashish Pandey from comment #4)
> Yes, We have tested  last few releases and did not see this issue
> I think this issue has been fixed and we can close this.


May I know which commit solves this issue?