Hide Forgot
tl;dr: When trying to run a replace-brick, the operation fails after a period of time and leaves the cluster in an inconsistent state, requiring a complete restart (and possibly a redefinition of the volume) to try again. My setup: Four peers: cloud0: Hostname: 153.90.178.112 Uuid: ac089196-dfe1-4743-96d9-fe349dae8387 cloud1: Hostname: 153.90.178.253 Uuid: 6feec985-cc5f-407a-98d8-45daa7438fee cloud2: Hostname: 153.90.203.10 Uuid: 2996900d-4a53-4dd9-b17a-afdcd9ef6c93 cloud3: Hostname: 153.90.203.11 Uuid: dc5fb858-dc14-409c-9633-dde01891b49f Volume: Volume Name: store Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 153.90.178.112:/mnt/live Brick2: 153.90.178.253:/mnt/live Commands used: gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live start gluster volume replace-brick store 153.90.178.112:/mnt/live 153.90.203.10:/mnt/live status This was started on 10/09/2011 around 15:00 (give or take an hour). The status and the disk activity indicated that it was working. However, at some point before 20:30 the same day, the transfer stopped. At this point, cloud0 and cloud2 could not tell me the status of the replace-brick. The next morning (around 11:00) I started trying to restart the replace-brick. I was able to abort the previous one, and start a new one. This new one reported success, but failed immediately. After trying this several times, the cluster entered an inconsistent state where cloud0 was trying to initiate a replace-brick operation that cloud2 thought was already in progress. Restarting all gluster processes on cloud2 did not alleviate this problem. I was unable to restart gluster processes on cloud0 because it is a production machine. Following advice in #gluster, I checked the contents of the rbstate file: root@cloud2:/var/log/glusterfs# cat /etc/glusterd/vols/store/rbstate rb_status=1 rb_src=153.90.178.112:/mnt/live rb_dst=153.90.203.10:/mnt/live root@cloud0:/etc/glusterd/vols/store# cat rbstate rb_status=1 rb_src=153.90.178.112:/mnt/live rb_dst=153.90.203.10:/mnt/live Attached to this bug are the complete log directories for both cloud0 and cloud2. Please keep these files confidential, as they have not been anonymized. Thank you!
CHANGE: http://review.gluster.com/609 (Change-Id: Ie14492451cab821e7ed60e68dbaff22d7d78fba9) merged in release-3.2 by Vijay Bellur (vijay)
CHANGE: http://review.gluster.com/2689 (glusterd: Refactored rb subcmds code and fixed some minor issues.) merged in master by Vijay Bellur (vijay)