Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 765441 (GLUSTER-3709)

Summary:	volume replace-brick unstable
Product:	[Community] GlusterFS	Reporter:	Matt Harris <matthaeus.harris>
Component:	glusterd	Assignee:	krishnan parthasarathi <kparthas>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.1.7	CC:	amarts, gluster-bugs, nsathyan
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matt Harris 2011-10-10 19:43:40 UTC

tl;dr: When trying to run a replace-brick, the operation fails after a period of time and leaves the cluster in an inconsistent state, requiring a complete restart (and possibly a redefinition of the volume) to try again.

My setup:

Four peers:

cloud0:
Hostname: 153.90.178.112
Uuid: ac089196-dfe1-4743-96d9-fe349dae8387

cloud1:
Hostname: 153.90.178.253
Uuid: 6feec985-cc5f-407a-98d8-45daa7438fee

cloud2:
Hostname: 153.90.203.10
Uuid: 2996900d-4a53-4dd9-b17a-afdcd9ef6c93

cloud3:
Hostname: 153.90.203.11
Uuid: dc5fb858-dc14-409c-9633-dde01891b49f

Volume:
Volume Name: store
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 153.90.178.112:/mnt/live
Brick2: 153.90.178.253:/mnt/live

Commands used:
gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live start
gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live status

This was started on 10/09/2011 around 15:00 (give or take an hour). The status and the disk activity indicated that it was working. However, at some point before 20:30 the same day, the transfer stopped. At this point, cloud0 and cloud2 could not tell me the status of the replace-brick.

The next morning (around 11:00) I started trying to restart the replace-brick. I was able to abort the previous one, and start a new one. This new one reported success, but failed immediately. After trying this several times, the cluster entered an inconsistent state where cloud0 was trying to initiate a replace-brick operation that cloud2 thought was already in progress. Restarting all gluster processes on cloud2 did not alleviate this problem. I was unable to restart gluster processes on cloud0 because it is a production machine.

Following advice in #gluster, I checked the contents of the rbstate file:
root@cloud2:/var/log/glusterfs# cat /etc/glusterd/vols/store/rbstate
rb_status=1
rb_src=153.90.178.112:/mnt/live
rb_dst=153.90.203.10:/mnt/live

root@cloud0:/etc/glusterd/vols/store# cat rbstate
rb_status=1
rb_src=153.90.178.112:/mnt/live
rb_dst=153.90.203.10:/mnt/live

Attached to this bug are the complete log directories for both cloud0 and cloud2. Please keep these files confidential, as they have not been anonymized.

Thank you!

Comment 1 Anand Avati 2011-10-19 07:19:59 UTC

CHANGE: http://review.gluster.com/609 (Change-Id: Ie14492451cab821e7ed60e68dbaff22d7d78fba9) merged in release-3.2 by Vijay Bellur (vijay)

Comment 2 Anand Avati 2012-01-27 12:23:16 UTC

CHANGE: http://review.gluster.com/2689 (glusterd: Refactored rb subcmds code and fixed some minor issues.) merged in master by Vijay Bellur (vijay)