Bug 765441 - (GLUSTER-3709) volume replace-brick unstable
volume replace-brick unstable
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: glusterd (Show other bugs)
3.1.7
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: krishnan parthasarathi
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-10-10 15:43 EDT by Matt Harris
Modified: 2015-11-03 18:03 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matt Harris 2011-10-10 15:43:40 EDT
tl;dr:  When trying to run a replace-brick, the operation fails after a period of time and leaves the cluster in an inconsistent state, requiring a complete restart (and possibly a redefinition of the volume) to try again.

My setup:

Four peers:

cloud0:
Hostname: 153.90.178.112
Uuid: ac089196-dfe1-4743-96d9-fe349dae8387

cloud1:
Hostname: 153.90.178.253
Uuid: 6feec985-cc5f-407a-98d8-45daa7438fee

cloud2:
Hostname: 153.90.203.10
Uuid: 2996900d-4a53-4dd9-b17a-afdcd9ef6c93

cloud3:
Hostname: 153.90.203.11
Uuid: dc5fb858-dc14-409c-9633-dde01891b49f

Volume:
Volume Name: store
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 153.90.178.112:/mnt/live
Brick2: 153.90.178.253:/mnt/live


Commands used:
gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live start
gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live status

This was started on 10/09/2011 around 15:00 (give or take an hour).  The status and the disk activity indicated that it was working.  However, at some point before 20:30 the same day, the transfer stopped.  At this point, cloud0 and cloud2 could not tell me the status of the replace-brick.

The next morning (around 11:00) I started trying to restart the replace-brick.  I was able to abort the previous one, and start a new one.  This new one reported success, but failed immediately.  After trying this several times, the cluster entered an inconsistent state where cloud0 was trying to initiate a replace-brick operation that cloud2 thought was already in progress.  Restarting all gluster processes on cloud2 did not alleviate this problem.  I was unable to restart gluster processes on cloud0 because it is a production machine. 

Following advice in #gluster, I checked the contents of the rbstate file:
root@cloud2:/var/log/glusterfs# cat /etc/glusterd/vols/store/rbstate
rb_status=1
rb_src=153.90.178.112:/mnt/live
rb_dst=153.90.203.10:/mnt/live

root@cloud0:/etc/glusterd/vols/store# cat rbstate
rb_status=1
rb_src=153.90.178.112:/mnt/live
rb_dst=153.90.203.10:/mnt/live

Attached to this bug are the complete log directories for both cloud0 and cloud2.  Please keep these files confidential, as they have not been anonymized.

Thank you!
Comment 1 Anand Avati 2011-10-19 03:19:59 EDT
CHANGE: http://review.gluster.com/609 (Change-Id: Ie14492451cab821e7ed60e68dbaff22d7d78fba9) merged in release-3.2 by Vijay Bellur (vijay@gluster.com)
Comment 2 Anand Avati 2012-01-27 07:23:16 EST
CHANGE: http://review.gluster.com/2689 (glusterd: Refactored rb subcmds code and fixed some minor issues.) merged in master by Vijay Bellur (vijay@gluster.com)

Note You need to log in before you can comment on or make changes to this bug.