Bug 1000779

Summary: running add-brick then remove-brick, then restarting gluster leads to broken volume brick counts
Product: [Community] GlusterFS Reporter: Justin Randell <justin.randell>
Component: cliAssignee: Kaushal <kaushal>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.2CC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-29 12:53:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Randell 2013-08-25 07:26:19 UTC
Description of problem:

simultaneous remove-brick commands corrupt volumes.

Steps to Reproduce:

1. set up a simple replicated volume with two nodes

{code}
root@gluster1:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

2. add a third brick to the replica

{code}
root@gluster2:~# gluster volume add-brick hosting-test replica 3 gluster1.justindev:/export/brick2/sdc1
Add Brick successful
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
Brick3: gluster1.justindev:/export/brick2/sdc1
{code}

3. aaaand now for the fun bit. remove the brick at the same time from both nodes, one will fail, both will report a healthy volume.

here's the node that wins:

{code}
root@gluster1:~# echo y | gluster volume remove-brick hosting-test replica 2 gluster1.justindev:/export/brick2/sdc1
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) Remove Brick commit force successful
root@gluster1:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

and the node that fails:

{code}
root@gluster2:~# echo y | gluster volume remove-brick hosting-test replica 2 gluster1.justindev:/export/brick2/sdc1
Operation failed
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) root@gluster2:~#
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

4. stop and start gluster on either node, and we get funky maths:

{code}
root@gluster2:~# service glusterfs-server stop
glusterfs-server stop/waiting
root@gluster2:~# service glusterfs-server start
glusterfs-server start/running, process 11739
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: f8d7132b-6bb1-40d4-8414-b2168cdf2cd7
Status: Started
Number of Bricks: 0 x 3 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

Actual results:

volume ends up with funky maths for bricks.

Expected results:

volumes continue operating normally.

Additional info:

Ubuntu 13.04, using the 3.3 packages from http://download.gluster.org/pub/gluster/glusterfs/3.3/3.3.2/Ubuntu.README

Comment 1 Justin Randell 2013-08-29 12:34:44 UTC
this bug is worse than my initial description.

it can reproduce this, on 3.3 and 3.4, with just these steps:

1. create a simple replicated volume across two nodes, on brick on each node

2. add a third brick to the volume from one of the existing nodes

3. remove the brick

4. restart gluster

Comment 2 Justin Randell 2013-08-29 12:53:35 UTC

*** This bug has been marked as a duplicate of bug 1002556 ***