1000779 – running add-brick then remove-brick, then restarting gluster leads to broken volume brick counts

Bug 1000779 - running add-brick then remove-brick, then restarting gluster leads to broken volume brick counts

Summary: running add-brick then remove-brick, then restarting gluster leads to broken ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1002556
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	cli
Sub Component:
Version:	3.3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kaushal
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-25 07:26 UTC by Justin Randell
Modified:	2013-08-29 12:53 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-08-29 12:53:35 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Justin Randell 2013-08-25 07:26:19 UTC

Description of problem:

simultaneous remove-brick commands corrupt volumes.

Steps to Reproduce:

1. set up a simple replicated volume with two nodes

{code}
root@gluster1:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

2. add a third brick to the replica

{code}
root@gluster2:~# gluster volume add-brick hosting-test replica 3 gluster1.justindev:/export/brick2/sdc1
Add Brick successful
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
Brick3: gluster1.justindev:/export/brick2/sdc1
{code}

3. aaaand now for the fun bit. remove the brick at the same time from both nodes, one will fail, both will report a healthy volume.

here's the node that wins:

{code}
root@gluster1:~# echo y | gluster volume remove-brick hosting-test replica 2 gluster1.justindev:/export/brick2/sdc1
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) Remove Brick commit force successful
root@gluster1:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

and the node that fails:

{code}
root@gluster2:~# echo y | gluster volume remove-brick hosting-test replica 2 gluster1.justindev:/export/brick2/sdc1
Operation failed
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) root@gluster2:~#
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

4. stop and start gluster on either node, and we get funky maths:

{code}
root@gluster2:~# service glusterfs-server stop
glusterfs-server stop/waiting
root@gluster2:~# service glusterfs-server start
glusterfs-server start/running, process 11739
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: f8d7132b-6bb1-40d4-8414-b2168cdf2cd7
Status: Started
Number of Bricks: 0 x 3 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

Actual results:

volume ends up with funky maths for bricks.

Expected results:

volumes continue operating normally.

Additional info:

Ubuntu 13.04, using the 3.3 packages from http://download.gluster.org/pub/gluster/glusterfs/3.3/3.3.2/Ubuntu.README

Comment 1 Justin Randell 2013-08-29 12:34:44 UTC

this bug is worse than my initial description.

it can reproduce this, on 3.3 and 3.4, with just these steps:

1. create a simple replicated volume across two nodes, on brick on each node

2. add a third brick to the volume from one of the existing nodes

3. remove the brick

4. restart gluster

Comment 2 Justin Randell 2013-08-29 12:53:35 UTC


*** This bug has been marked as a duplicate of bug 1002556 ***

Note You need to log in before you can comment on or make changes to this bug.