Bug 1002556 - running add-brick then remove-brick, then restarting gluster leads to broken volume brick counts
Summary: running add-brick then remove-brick, then restarting gluster leads to broken ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.4.0
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: krishnan parthasarathi
QA Contact:
URL:
Whiteboard:
: 1000779 (view as bug list)
Depends On:
Blocks: 1019683
TreeView+ depends on / blocked
 
Reported: 2013-08-29 12:52 UTC by Justin Randell
Modified: 2015-11-03 23:05 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.4.3
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1019683 (view as bug list)
Environment:
Last Closed: 2014-04-17 13:14:06 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Justin Randell 2013-08-29 12:52:40 UTC
Description of problem:

running add-brick then remove-brick, then restarting gluster leads to broken volume brick counts

Steps to Reproduce:

1. set up a simple replicated volume with two nodes

{code}
root@gluster1:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

2. add a third brick to the replica

{code}
root@gluster2:~# gluster volume add-brick hosting-test replica 3 gluster1.justindev:/export/brick2/sdc1
Add Brick successful
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
Brick3: gluster1.justindev:/export/brick2/sdc1
{code}

3. remove the brick

{code}
root@gluster1:~# echo y | gluster volume remove-brick hosting-test replica 2 gluster1.justindev:/export/brick2/sdc1
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) Remove Brick commit force successful
root@gluster1:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

4. stop and start gluster on either node, and we get funky maths:

{code}
root@gluster2:~# service glusterfs-server stop
glusterfs-server stop/waiting
root@gluster2:~# service glusterfs-server start
glusterfs-server start/running, process 11739
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: f8d7132b-6bb1-40d4-8414-b2168cdf2cd7
Status: Started
Number of Bricks: 0 x 3 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

Actual results:

volume ends up with funky maths for bricks.

Expected results:

volume reports 1 x 2 = 2 for bricks.

Additional info:

Ubuntu 13.04, using the 3.3 or 3.4 packages from http://download.gluster.org/pub/gluster/glusterfs/*/Ubuntu.README

Comment 1 Justin Randell 2013-08-29 12:53:35 UTC
*** Bug 1000779 has been marked as a duplicate of this bug. ***

Comment 2 Marc Seeger 2013-09-04 13:18:57 UTC
Additional info:

Re-adding a brick results in an "operation failed", but the operation does indeed succeed and it seems to fix it.


    [13:12:53] root:~# gluster volume info
 
    Volume Name: test-fs-cluster-1
    Type: Replicate
    Volume ID: f3117deb-f5f5-40ff-94b5-98b2095239b2
    Status: Started
    Number of Bricks: 0 x 3 = 2
    Transport-type: tcp
    Bricks:
    Brick1: fs-15.mseeger.example.dev:/mnt/brick22
    Brick2: fs-14.mseeger.example.dev:/mnt/brick23
    
    
    [13:12:55] root:~# rm -rf /mnt/bla/
    [13:13:00] root:~# mkdir /mnt/bla
    [13:13:02] root:~# gluster volume add-brick test-fs-cluster-1  replica 3 fs-15:/mnt/bla/
    Operation failed on fs-14.mseeger.example.dev
    
    
    
    
    [13:13:08] root:~# gluster volume info
     
    Volume Name: test-fs-cluster-1
    Type: Replicate
    Volume ID: f3117deb-f5f5-40ff-94b5-98b2095239b2
    Status: Started
    Number of Bricks: 1 x 3 = 3
    Transport-type: tcp
    Bricks:
    Brick1: fs-15.mseeger.example.dev:/mnt/brick22
    Brick2: fs-14.mseeger.example.dev:/mnt/brick23
    Brick3: fs-15:/mnt/bla


Adding it a second time will for some reason remove that brick:



    [13:15:03] root:~# gluster volume add-brick test-fs-cluster-1  replica 3 fs-15:/mnt/bla/
    Operation failed
    [13:15:04] root:~# gluster volume info
     
    Volume Name: test-fs-cluster-1
    Type: Replicate
    Volume ID: f3117deb-f5f5-40ff-94b5-98b2095239b2
    Status: Started
    Number of Bricks: 1 x 2 = 2
    Transport-type: tcp
    Bricks:
    Brick1: fs-15.mseeger.example.dev:/mnt/brick22
    Brick2: fs-14.mseeger.example.dev:/mnt/brick23





I'm not quite sure what's up with the volume geometry, but it's certainly corrupted

Comment 3 Anand Avati 2013-09-10 19:59:15 UTC
REVIEW: http://review.gluster.org/5893 (mgmt/glusterd: Update sub_count on remove brick) posted (#1) for review on master by Vijay Bellur (vbellur)

Comment 4 Anand Avati 2013-09-11 04:24:49 UTC
REVIEW: http://review.gluster.org/5893 (mgmt/glusterd: Update sub_count on remove brick) posted (#2) for review on master by Vijay Bellur (vbellur)

Comment 5 Marc Seeger 2013-09-11 13:17:09 UTC
This seems to have fixed it.
Will this be backported to 3.3 / 3.4?

Comment 6 Marc Seeger 2013-09-11 13:24:40 UTC
This is what it looks like after the fix:







[13:14:20] root:~# gluster volume info
 
Volume Name: test-fs-cluster-1
Type: Replicate
Volume ID: a25ac752-57c9-4496-92ca-bfdcb964edd4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: fs-21.dev:/mnt/brick37
Brick2: fs-22.dev:/mnt/brick36
[13:14:47] root:~# mkdir /mnt/bla
[13:15:08] root:~# gluster volume add-brick test-fs-cluster-1 replica 3 fs-21:/mnt/bla/
Add Brick successful
[13:15:42] root:~# gluster volume info
 
Volume Name: test-fs-cluster-1
Type: Replicate
Volume ID: a25ac752-57c9-4496-92ca-bfdcb964edd4
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: fs-21.dev:/mnt/brick37
Brick2: fs-22.dev:/mnt/brick36
Brick3: fs-21:/mnt/bla
[13:15:49] root:~# echo y | gluster volume remove-brick test-fs-cluster-1 replica 2 fs-21:/mnt/bla/
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) Remove Brick commit force successful
[13:16:17] root:~# gluster volume info
 
Volume Name: test-fs-cluster-1
Type: Replicate
Volume ID: a25ac752-57c9-4496-92ca-bfdcb964edd4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: fs-21.dev:/mnt/brick37
Brick2: fs-22.dev:/mnt/brick36
[13:16:23] root:~# service glusterfs-server stop
glusterfs-server stop/waiting
[13:16:34] root:~# service glusterfs-server start
glusterfs-server start/running, process 29760
[13:16:37] root:~# gluster volume info
 
Volume Name: test-fs-cluster-1
Type: Replicate
Volume ID: a25ac752-57c9-4496-92ca-bfdcb964edd4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: fs-21.dev:/mnt/brick37
Brick2: fs-22.dev:/mnt/brick36

Comment 7 Anand Avati 2013-09-12 05:52:19 UTC
COMMIT: http://review.gluster.org/5893 committed in master by Anand Avati (avati) 
------
commit 643533c77fd49316b7d16015fa1a008391d14bb2
Author: Vijay Bellur <vbellur>
Date:   Wed Sep 11 01:26:13 2013 +0530

    mgmt/glusterd: Update sub_count on remove brick
    
    Change-Id: I7c17de39da03c6b2764790581e097936da406695
    BUG: 1002556
    Signed-off-by: Vijay Bellur <vbellur>
    Reviewed-on: http://review.gluster.org/5893
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Reviewed-by: Anand Avati <avati>

Comment 8 Anand Avati 2013-09-12 15:43:15 UTC
REVIEW: http://review.gluster.org/5902 (mgmt/glusterd: Update sub_count on remove brick) posted (#1) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 9 Anand Avati 2013-09-13 16:51:48 UTC
COMMIT: http://review.gluster.org/5902 committed in release-3.4 by Vijay Bellur (vbellur) 
------
commit d9dde294cfd7bb83bccbe777dfd58b925a6f2f7b
Author: Vijay Bellur <vbellur>
Date:   Wed Sep 11 01:26:13 2013 +0530

    mgmt/glusterd: Update sub_count on remove brick
    
    Change-Id: I7c17de39da03c6b2764790581e097936da406695
    BUG: 1002556
    Signed-off-by: Vijay Bellur <vbellur>
    Reviewed-on: http://review.gluster.org/5902
    Tested-by: Gluster Build System <jenkins.com>

Comment 10 Marc Seeger 2013-09-13 17:44:17 UTC
This is alsy failing in 3.3
Will there be a backport?
(I tested the fix on 3.3, worked fine)

Comment 11 Marc Seeger 2013-09-13 17:44:35 UTC
This is also failing in 3.3
Will there be a backport?
(I tested the fix on 3.3, worked fine)

Comment 12 Niels de Vos 2014-04-17 13:14:06 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.4.3, please reopen this bug report.

glusterfs-3.4.3 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.4.3. In the same line the recent release i.e. glusterfs-3.5.0 [3] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.5".

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978
[2] http://news.gmane.org/gmane.comp.file-systems.gluster.user
[3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137


Note You need to log in before you can comment on or make changes to this bug.