Description of problem: ======================= rebalance operation because of remove-brick failed on one of the cluster node rebalance warning and error messages: ------------------------------------- [2016-12-23 07:05:52.568409] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-Disperse1-client-12: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2016-12-23 07:05:52.569163] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-Disperse1-client-12: Connected to Disperse1-client-12, attached to remote volume '/bricks/bric k2/a0'. [2016-12-23 07:05:52.569189] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-Disperse1-client-12: Server and Client lk-version numbers are not same, reopening the fds [2016-12-23 07:05:52.570742] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-Disperse1-client-12: Server lk version = 1 [2016-12-23 07:05:55.203018] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called [2016-12-23 07:06:01.767154] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called [2016-12-23 07:06:01.992148] W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting The message "W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting" repeated 2 times between [2016-12-23 07:06:01.992148] and [2016-12-23 07:06:02.992415] [2016-12-23 07:06:02.997440] E [MSGID: 109027] [dht-rebalance.c:3696:gf_defrag_start_crawl] 0-Disperse1-dht: Failed to start rebalance: look up on / failed [2016-12-23 07:06:02.997723] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-Disperse1-dht: Rebalance is failed. Time taken is 0.00 secs [2016-12-23 07:06:02.997747] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-Disperse1-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0 [2016-12-23 07:06:02.997986] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x3c14607aa1) [0x7fd0233aaaa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fd0247bc3f5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fd0247bbee6] ) 0-: received signum (15), shutting down Version-Release number of selected component (if applicable): ============================================================= How reproducible: ================= 2/3 Steps to Reproduce: ==================== 1. Have 6 node cluster 2. Create a 2 * (4+2) volume and fuse mount it. 3. Keep writing the data at the mount point //untar linux kernel 4. Add one more sub volume to make 3 * (4+2) 5. Once untar is over, remove the last added sub volume. //during this step, rebalance failed in one of node Actual results: =============== rebalance operation because of remove-brick failed on one of the cluster node Expected results: ================= Rebalance should start wihtout issue when volume bricks having data are removed. Additional info: ================ This issue not reproducible always and Live setup was showed to one of the DHT team member to get some idea about the issue.
Please see the comment: > The rebalance process received a CHILD_DOWN event so it will terminate. This > is the expected behaviour. > > The EC team needs to look into why the EC subvol returned a CHILD_DOWN event. > > Moving this to the EC team to take a look.
Ashish, did we finally fix this? Whats the latest on this?
Yes, We have tested last few releases and did not see this issue I think this issue has been fixed and we can close this.
Hi Ashish(In reply to Ashish Pandey from comment #4) > Yes, We have tested last few releases and did not see this issue > I think this issue has been fixed and we can close this. May I know which commit solves this issue?