1408621 – rebalance operation because of remove-brick failed on one of the cluster node

Bug 1408621 - rebalance operation because of remove-brick failed on one of the cluster node

Summary: rebalance operation because of remove-brick failed on one of the cluster node

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Pranith Kumar K
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:	1417535
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-26 04:30 UTC by Byreddy
Modified:	2018-11-12 03:40 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1417535 (view as bug list)
Environment:
Last Closed:	2018-11-09 10:58:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1485616	0	low	CLOSED	Remove brick fails on a distribute volume - "Failed to start rebalance: look up on / failed"	2021-02-22 00:41:40 UTC

Internal Links: 1485616

Description Byreddy 2016-12-26 04:30:28 UTC

Description of problem:
=======================
rebalance operation because of remove-brick  failed on one of the cluster node

rebalance warning and error messages:
-------------------------------------
[2016-12-23 07:05:52.568409] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-Disperse1-client-12: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-12-23 07:05:52.569163] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-Disperse1-client-12: Connected to Disperse1-client-12, attached to remote volume '/bricks/bric
k2/a0'.
[2016-12-23 07:05:52.569189] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-Disperse1-client-12: Server and Client lk-version numbers are not same, reopening the fds
[2016-12-23 07:05:52.570742] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-Disperse1-client-12: Server lk version = 1
[2016-12-23 07:05:55.203018] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.767154] W [MSGID: 114010] [client-callback.c:28:client_cbk_fetchspec] 0-Disperse1-client-4: this function should not be called
[2016-12-23 07:06:01.992148] W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting
The message "W [MSGID: 109073] [dht-common.c:8753:dht_notify] 0-Disperse1-dht: Received CHILD_DOWN. Exiting" repeated 2 times between [2016-12-23 07:06:01.992148] and [2016-12-23 07:06:02.992415]
[2016-12-23 07:06:02.997440] E [MSGID: 109027] [dht-rebalance.c:3696:gf_defrag_start_crawl] 0-Disperse1-dht: Failed to start rebalance: look up on / failed
[2016-12-23 07:06:02.997723] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-Disperse1-dht: Rebalance is failed. Time taken is 0.00 secs
[2016-12-23 07:06:02.997747] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-Disperse1-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2016-12-23 07:06:02.997986] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x3c14607aa1) [0x7fd0233aaaa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fd0247bc3f5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fd0247bbee6] ) 0-: received signum (15), shutting down




Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-9.el6rhs.x86_64


How reproducible:
=================
2/3


Steps to Reproduce:
====================
1. Have 6 node cluster
2. Create a 2 * (4+2) volume and fuse mount it.
3. Keep writing the data at the mount point //untar linux kernel
4. Add one more sub volume to make 3 * (4+2)
5. Once untar is over, remove the last added sub volume.  //during this step, rebalance failed in one of node 


Actual results:
===============
rebalance operation because of remove-brick  failed on one of the cluster node

Expected results:
=================
Rebalance should start wihtout issue when volume bricks having data are removed.

Additional info:
================
This issue not reproducible always and Live setup was showed to one of the DHT team member to get some idea about the issue.

and 

This issue was found while testing -9 build of 3.2.0.

Comment 15 Nithya Balachandran 2017-01-24 08:41:01 UTC

This needs information from the EC team. Ashish/Pranith, could one of you take a look at why the CHILD_DOWN events were sent?

Thanks,
Nithya

Note You need to log in before you can comment on or make changes to this bug.