This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours

Bug 765441 (GLUSTER-3709)

Summary: volume replace-brick unstable
Product: [Community] GlusterFS Reporter: Matt Harris <matthaeus.harris>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.1.7CC: amarts, gluster-bugs, nsathyan
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Matt Harris 2011-10-10 15:43:40 EDT
tl;dr:  When trying to run a replace-brick, the operation fails after a period of time and leaves the cluster in an inconsistent state, requiring a complete restart (and possibly a redefinition of the volume) to try again.

My setup:

Four peers:

cloud0:
Hostname: 153.90.178.112
Uuid: ac089196-dfe1-4743-96d9-fe349dae8387

cloud1:
Hostname: 153.90.178.253
Uuid: 6feec985-cc5f-407a-98d8-45daa7438fee

cloud2:
Hostname: 153.90.203.10
Uuid: 2996900d-4a53-4dd9-b17a-afdcd9ef6c93

cloud3:
Hostname: 153.90.203.11
Uuid: dc5fb858-dc14-409c-9633-dde01891b49f

Volume:
Volume Name: store
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 153.90.178.112:/mnt/live
Brick2: 153.90.178.253:/mnt/live


Commands used:
gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live start
gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live status

This was started on 10/09/2011 around 15:00 (give or take an hour).  The status and the disk activity indicated that it was working.  However, at some point before 20:30 the same day, the transfer stopped.  At this point, cloud0 and cloud2 could not tell me the status of the replace-brick.

The next morning (around 11:00) I started trying to restart the replace-brick.  I was able to abort the previous one, and start a new one.  This new one reported success, but failed immediately.  After trying this several times, the cluster entered an inconsistent state where cloud0 was trying to initiate a replace-brick operation that cloud2 thought was already in progress.  Restarting all gluster processes on cloud2 did not alleviate this problem.  I was unable to restart gluster processes on cloud0 because it is a production machine. 

Following advice in #gluster, I checked the contents of the rbstate file:
root@cloud2:/var/log/glusterfs# cat /etc/glusterd/vols/store/rbstate
rb_status=1
rb_src=153.90.178.112:/mnt/live
rb_dst=153.90.203.10:/mnt/live

root@cloud0:/etc/glusterd/vols/store# cat rbstate
rb_status=1
rb_src=153.90.178.112:/mnt/live
rb_dst=153.90.203.10:/mnt/live

Attached to this bug are the complete log directories for both cloud0 and cloud2.  Please keep these files confidential, as they have not been anonymized.

Thank you!
Comment 1 Anand Avati 2011-10-19 03:19:59 EDT
CHANGE: http://review.gluster.com/609 (Change-Id: Ie14492451cab821e7ed60e68dbaff22d7d78fba9) merged in release-3.2 by Vijay Bellur (vijay@gluster.com)
Comment 2 Anand Avati 2012-01-27 07:23:16 EST
CHANGE: http://review.gluster.com/2689 (glusterd: Refactored rb subcmds code and fixed some minor issues.) merged in master by Vijay Bellur (vijay@gluster.com)