Bug 1002556 - running add-brick then remove-brick, then restarting gluster leads to broken volume brick counts
running add-brick then remove-brick, then restarting gluster leads to broken ...
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: glusterd (Show other bugs)
3.4.0
All Linux
unspecified Severity high
: ---
: ---
Assigned To: krishnan parthasarathi
:
: 1000779 (view as bug list)
Depends On:
Blocks: 1019683
  Show dependency treegraph
 
Reported: 2013-08-29 08:52 EDT by Justin Randell
Modified: 2015-11-03 18:05 EST (History)
4 users (show)

See Also:
Fixed In Version: glusterfs-3.4.3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1019683 (view as bug list)
Environment:
Last Closed: 2014-04-17 09:14:06 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Justin Randell 2013-08-29 08:52:40 EDT
Description of problem:

running add-brick then remove-brick, then restarting gluster leads to broken volume brick counts

Steps to Reproduce:

1. set up a simple replicated volume with two nodes

{code}
root@gluster1:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

2. add a third brick to the replica

{code}
root@gluster2:~# gluster volume add-brick hosting-test replica 3 gluster1.justindev:/export/brick2/sdc1
Add Brick successful
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
Brick3: gluster1.justindev:/export/brick2/sdc1
{code}

3. remove the brick

{code}
root@gluster1:~# echo y | gluster volume remove-brick hosting-test replica 2 gluster1.justindev:/export/brick2/sdc1
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) Remove Brick commit force successful
root@gluster1:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: 0dcadde0-b981-472d-851a-08fbfff40ae3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

4. stop and start gluster on either node, and we get funky maths:

{code}
root@gluster2:~# service glusterfs-server stop
glusterfs-server stop/waiting
root@gluster2:~# service glusterfs-server start
glusterfs-server start/running, process 11739
root@gluster2:~# gluster volume info
 
Volume Name: hosting-test
Type: Replicate
Volume ID: f8d7132b-6bb1-40d4-8414-b2168cdf2cd7
Status: Started
Number of Bricks: 0 x 3 = 2
Transport-type: tcp
Bricks:
Brick1: gluster2.justindev:/export/brick1/sdb1
Brick2: gluster1.justindev:/export/brick1/sdb1
{code}

Actual results:

volume ends up with funky maths for bricks.

Expected results:

volume reports 1 x 2 = 2 for bricks.

Additional info:

Ubuntu 13.04, using the 3.3 or 3.4 packages from http://download.gluster.org/pub/gluster/glusterfs/*/Ubuntu.README
Comment 1 Justin Randell 2013-08-29 08:53:35 EDT
*** Bug 1000779 has been marked as a duplicate of this bug. ***
Comment 2 Marc Seeger 2013-09-04 09:18:57 EDT
Additional info:

Re-adding a brick results in an "operation failed", but the operation does indeed succeed and it seems to fix it.


    [13:12:53] root@fs-15.mseeger:~# gluster volume info
 
    Volume Name: test-fs-cluster-1
    Type: Replicate
    Volume ID: f3117deb-f5f5-40ff-94b5-98b2095239b2
    Status: Started
    Number of Bricks: 0 x 3 = 2
    Transport-type: tcp
    Bricks:
    Brick1: fs-15.mseeger.example.dev:/mnt/brick22
    Brick2: fs-14.mseeger.example.dev:/mnt/brick23
    
    
    [13:12:55] root@fs-15.mseeger:~# rm -rf /mnt/bla/
    [13:13:00] root@fs-15.mseeger:~# mkdir /mnt/bla
    [13:13:02] root@fs-15.mseeger:~# gluster volume add-brick test-fs-cluster-1  replica 3 fs-15:/mnt/bla/
    Operation failed on fs-14.mseeger.example.dev
    
    
    
    
    [13:13:08] root@fs-15.mseeger:~# gluster volume info
     
    Volume Name: test-fs-cluster-1
    Type: Replicate
    Volume ID: f3117deb-f5f5-40ff-94b5-98b2095239b2
    Status: Started
    Number of Bricks: 1 x 3 = 3
    Transport-type: tcp
    Bricks:
    Brick1: fs-15.mseeger.example.dev:/mnt/brick22
    Brick2: fs-14.mseeger.example.dev:/mnt/brick23
    Brick3: fs-15:/mnt/bla


Adding it a second time will for some reason remove that brick:



    [13:15:03] root@fs-15.mseeger:~# gluster volume add-brick test-fs-cluster-1  replica 3 fs-15:/mnt/bla/
    Operation failed
    [13:15:04] root@fs-15.mseeger:~# gluster volume info
     
    Volume Name: test-fs-cluster-1
    Type: Replicate
    Volume ID: f3117deb-f5f5-40ff-94b5-98b2095239b2
    Status: Started
    Number of Bricks: 1 x 2 = 2
    Transport-type: tcp
    Bricks:
    Brick1: fs-15.mseeger.example.dev:/mnt/brick22
    Brick2: fs-14.mseeger.example.dev:/mnt/brick23





I'm not quite sure what's up with the volume geometry, but it's certainly corrupted
Comment 3 Anand Avati 2013-09-10 15:59:15 EDT
REVIEW: http://review.gluster.org/5893 (mgmt/glusterd: Update sub_count on remove brick) posted (#1) for review on master by Vijay Bellur (vbellur@redhat.com)
Comment 4 Anand Avati 2013-09-11 00:24:49 EDT
REVIEW: http://review.gluster.org/5893 (mgmt/glusterd: Update sub_count on remove brick) posted (#2) for review on master by Vijay Bellur (vbellur@redhat.com)
Comment 5 Marc Seeger 2013-09-11 09:17:09 EDT
This seems to have fixed it.
Will this be backported to 3.3 / 3.4?
Comment 6 Marc Seeger 2013-09-11 09:24:40 EDT
This is what it looks like after the fix:







[13:14:20] root@fs-21.mseeger:~# gluster volume info
 
Volume Name: test-fs-cluster-1
Type: Replicate
Volume ID: a25ac752-57c9-4496-92ca-bfdcb964edd4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: fs-21.dev:/mnt/brick37
Brick2: fs-22.dev:/mnt/brick36
[13:14:47] root@fs-21.mseeger:~# mkdir /mnt/bla
[13:15:08] root@fs-21.mseeger:~# gluster volume add-brick test-fs-cluster-1 replica 3 fs-21:/mnt/bla/
Add Brick successful
[13:15:42] root@fs-21.mseeger:~# gluster volume info
 
Volume Name: test-fs-cluster-1
Type: Replicate
Volume ID: a25ac752-57c9-4496-92ca-bfdcb964edd4
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: fs-21.dev:/mnt/brick37
Brick2: fs-22.dev:/mnt/brick36
Brick3: fs-21:/mnt/bla
[13:15:49] root@fs-21.mseeger:~# echo y | gluster volume remove-brick test-fs-cluster-1 replica 2 fs-21:/mnt/bla/
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) Remove Brick commit force successful
[13:16:17] root@fs-21.mseeger:~# gluster volume info
 
Volume Name: test-fs-cluster-1
Type: Replicate
Volume ID: a25ac752-57c9-4496-92ca-bfdcb964edd4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: fs-21.dev:/mnt/brick37
Brick2: fs-22.dev:/mnt/brick36
[13:16:23] root@fs-21.mseeger:~# service glusterfs-server stop
glusterfs-server stop/waiting
[13:16:34] root@fs-21.mseeger:~# service glusterfs-server start
glusterfs-server start/running, process 29760
[13:16:37] root@fs-21.mseeger:~# gluster volume info
 
Volume Name: test-fs-cluster-1
Type: Replicate
Volume ID: a25ac752-57c9-4496-92ca-bfdcb964edd4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: fs-21.dev:/mnt/brick37
Brick2: fs-22.dev:/mnt/brick36
Comment 7 Anand Avati 2013-09-12 01:52:19 EDT
COMMIT: http://review.gluster.org/5893 committed in master by Anand Avati (avati@redhat.com) 
------
commit 643533c77fd49316b7d16015fa1a008391d14bb2
Author: Vijay Bellur <vbellur@redhat.com>
Date:   Wed Sep 11 01:26:13 2013 +0530

    mgmt/glusterd: Update sub_count on remove brick
    
    Change-Id: I7c17de39da03c6b2764790581e097936da406695
    BUG: 1002556
    Signed-off-by: Vijay Bellur <vbellur@redhat.com>
    Reviewed-on: http://review.gluster.org/5893
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
    Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
    Reviewed-by: Anand Avati <avati@redhat.com>
Comment 8 Anand Avati 2013-09-12 11:43:15 EDT
REVIEW: http://review.gluster.org/5902 (mgmt/glusterd: Update sub_count on remove brick) posted (#1) for review on release-3.4 by Vijay Bellur (vbellur@redhat.com)
Comment 9 Anand Avati 2013-09-13 12:51:48 EDT
COMMIT: http://review.gluster.org/5902 committed in release-3.4 by Vijay Bellur (vbellur@redhat.com) 
------
commit d9dde294cfd7bb83bccbe777dfd58b925a6f2f7b
Author: Vijay Bellur <vbellur@redhat.com>
Date:   Wed Sep 11 01:26:13 2013 +0530

    mgmt/glusterd: Update sub_count on remove brick
    
    Change-Id: I7c17de39da03c6b2764790581e097936da406695
    BUG: 1002556
    Signed-off-by: Vijay Bellur <vbellur@redhat.com>
    Reviewed-on: http://review.gluster.org/5902
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
Comment 10 Marc Seeger 2013-09-13 13:44:17 EDT
This is alsy failing in 3.3
Will there be a backport?
(I tested the fix on 3.3, worked fine)
Comment 11 Marc Seeger 2013-09-13 13:44:35 EDT
This is also failing in 3.3
Will there be a backport?
(I tested the fix on 3.3, worked fine)
Comment 12 Niels de Vos 2014-04-17 09:14:06 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.4.3, please reopen this bug report.

glusterfs-3.4.3 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.4.3. In the same line the recent release i.e. glusterfs-3.5.0 [3] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.5".

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978
[2] http://news.gmane.org/gmane.comp.file-systems.gluster.user
[3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137

Note You need to log in before you can comment on or make changes to this bug.