Bug 1408621 - rebalance operation because of remove-brick failed on one of the cluster node
Summary: rebalance operation because of remove-brick failed on one of the cluster node
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Pranith Kumar K
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On: 1417535
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-26 04:30 UTC by Byreddy
Modified: 2018-11-12 03:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1417535 (view as bug list)
Environment:
Last Closed: 2018-11-09 10:58:37 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1485616 0 low CLOSED Remove brick fails on a distribute volume - "Failed to start rebalance: look up on / failed" 2021-02-22 00:41:40 UTC

Internal Links: 1485616

Description Byreddy 2016-12-26 04:30:28 UTC
Description of problem:
=======================
rebalance operation because of remove-brick  failed on one of the cluster node

rebalance warning and error messages:
-------------------------------------
[2016-12-23 07:05:52.568409] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-Disperse1-client-12: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-12-23 07:05:52.569163] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-Disperse1-client-12: Connected to Disperse1-client-12, attached to remote volume '/bricks/bric
k2/a0'.
[2016-12-23 07:05:52.569189] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-Disperse1-client-12: Server and Client lk-version numbers are not same, reopening the fds
[2016-12-23 07:05:52.570742] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-Disperse1-client-12: Server lk version = 1
[2016-12-23 07:05:55.203018] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.767154] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.992148] W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting
The message "W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting" repeated 2 times between [2016-12-23 07:06:01.992148] and [2016-12-23 07:06:02.992415]
[2016-12-23 07:06:02.997440] E [MSGID: 109027] [dht-rebalance.c:3696:gf_defrag_start_crawl] 0-Disperse1-dht: Failed to start rebalance: look up on / failed
[2016-12-23 07:06:02.997723] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-Disperse1-dht: Rebalance is failed. Time taken is 0.00 secs
[2016-12-23 07:06:02.997747] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-Disperse1-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2016-12-23 07:06:02.997986] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x3c14607aa1) [0x7fd0233aaaa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fd0247bc3f5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fd0247bbee6] ) 0-: received signum (15), shutting down




Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-9.el6rhs.x86_64


How reproducible:
=================
2/3


Steps to Reproduce:
====================
1. Have 6 node cluster
2. Create a 2 * (4+2) volume and fuse mount it.
3. Keep writing the data at the mount point //untar linux kernel
4. Add one more sub volume to make 3 * (4+2)
5. Once untar is over, remove the last added sub volume.  //during this step, rebalance failed in one of node 


Actual results:
===============
rebalance operation because of remove-brick  failed on one of the cluster node

Expected results:
=================
Rebalance should start wihtout issue when volume bricks having data are removed.

Additional info:
================
This issue not reproducible always and Live setup was showed to one of the DHT team member to get some idea about the issue.

and 

This issue was found while testing -9 build of 3.2.0.

Comment 15 Nithya Balachandran 2017-01-24 08:41:01 UTC
This needs information from the EC team. Ashish/Pranith, could one of you take a look at why the CHILD_DOWN events were sent?

Thanks,
Nithya


Note You need to log in before you can comment on or make changes to this bug.