Bug 765441 (GLUSTER-3709) - volume replace-brick unstable
Summary: volume replace-brick unstable
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-3709
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.1.7
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: krishnan parthasarathi
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-10 19:43 UTC by Matt Harris
Modified: 2015-11-03 23:03 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Matt Harris 2011-10-10 19:43:40 UTC
tl;dr:  When trying to run a replace-brick, the operation fails after a period of time and leaves the cluster in an inconsistent state, requiring a complete restart (and possibly a redefinition of the volume) to try again.

My setup:

Four peers:

cloud0:
Hostname: 153.90.178.112
Uuid: ac089196-dfe1-4743-96d9-fe349dae8387

cloud1:
Hostname: 153.90.178.253
Uuid: 6feec985-cc5f-407a-98d8-45daa7438fee

cloud2:
Hostname: 153.90.203.10
Uuid: 2996900d-4a53-4dd9-b17a-afdcd9ef6c93

cloud3:
Hostname: 153.90.203.11
Uuid: dc5fb858-dc14-409c-9633-dde01891b49f

Volume:
Volume Name: store
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 153.90.178.112:/mnt/live
Brick2: 153.90.178.253:/mnt/live


Commands used:
gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live start
gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live status

This was started on 10/09/2011 around 15:00 (give or take an hour).  The status and the disk activity indicated that it was working.  However, at some point before 20:30 the same day, the transfer stopped.  At this point, cloud0 and cloud2 could not tell me the status of the replace-brick.

The next morning (around 11:00) I started trying to restart the replace-brick.  I was able to abort the previous one, and start a new one.  This new one reported success, but failed immediately.  After trying this several times, the cluster entered an inconsistent state where cloud0 was trying to initiate a replace-brick operation that cloud2 thought was already in progress.  Restarting all gluster processes on cloud2 did not alleviate this problem.  I was unable to restart gluster processes on cloud0 because it is a production machine. 

Following advice in #gluster, I checked the contents of the rbstate file:
root@cloud2:/var/log/glusterfs# cat /etc/glusterd/vols/store/rbstate
rb_status=1
rb_src=153.90.178.112:/mnt/live
rb_dst=153.90.203.10:/mnt/live

root@cloud0:/etc/glusterd/vols/store# cat rbstate
rb_status=1
rb_src=153.90.178.112:/mnt/live
rb_dst=153.90.203.10:/mnt/live

Attached to this bug are the complete log directories for both cloud0 and cloud2.  Please keep these files confidential, as they have not been anonymized.

Thank you!

Comment 1 Anand Avati 2011-10-19 07:19:59 UTC
CHANGE: http://review.gluster.com/609 (Change-Id: Ie14492451cab821e7ed60e68dbaff22d7d78fba9) merged in release-3.2 by Vijay Bellur (vijay)

Comment 2 Anand Avati 2012-01-27 12:23:16 UTC
CHANGE: http://review.gluster.com/2689 (glusterd: Refactored rb subcmds code and fixed some minor issues.) merged in master by Vijay Bellur (vijay)


Note You need to log in before you can comment on or make changes to this bug.