Bug 1346854 - Disk failed, can't (cleanly) remove brick.
Summary: Disk failed, can't (cleanly) remove brick.
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 3.7.11
Hardware: x86_64
OS: Linux
Target Milestone: ---
Assignee: Sandeep Bansal
QA Contact:
Whiteboard: dht-remove-brick
Depends On:
TreeView+ depends on / blocked
Reported: 2016-06-15 13:12 UTC by Phil Dumont
Modified: 2017-03-08 10:59 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-03-08 10:59:59 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:

Attachments (Terms of Use)
compressed (bzip2) tarball of /var/log/glusterfs on server with failed disk (18.49 MB, application/x-bzip)
2016-06-15 13:12 UTC, Phil Dumont
no flags Details

Description Phil Dumont 2016-06-15 13:12:47 UTC
Created attachment 1168372 [details]
compressed (bzip2) tarball of /var/log/glusterfs on server with failed disk

Description of problem:

The following is cut-n-paste from an email to gluster-users@gluster.org.  I received a reply saying it looked like a bug and would I please submit a BZ.  So here it is.

---- vvvv ---- Begin email cut-n-past ---- vvvv ----

Just started trying gluster, to decide if we want to put it into production.

Running version 3.7.11-1

Replicated, distributed volume, two servers, 20 bricks per server:

[root@storinator1 ~]# gluster volume status gv0
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick storinator1:/export/brick1/gv0        49153     0          Y       2554 
Brick storinator2:/export/brick1/gv0        49153     0          Y       9686 
Brick storinator1:/export/brick2/gv0        49154     0          Y       2562 
Brick storinator2:/export/brick2/gv0        49154     0          Y       9708 
Brick storinator1:/export/brick3/gv0        49155     0          Y       2568 
Brick storinator2:/export/brick3/gv0        49155     0          Y       9692 
Brick storinator1:/export/brick4/gv0        49156     0          Y       2574 
Brick storinator2:/export/brick4/gv0        49156     0          Y       9765 
Brick storinator1:/export/brick5/gv0        49173     0          Y       16901
Brick storinator2:/export/brick5/gv0        49173     0          Y       9727 
Brick storinator1:/export/brick6/gv0        49174     0          Y       16920
Brick storinator2:/export/brick6/gv0        49174     0          Y       9733 
Brick storinator1:/export/brick7/gv0        49175     0          Y       16939
Brick storinator2:/export/brick7/gv0        49175     0          Y       9739 
Brick storinator1:/export/brick8/gv0        49176     0          Y       16958
Brick storinator2:/export/brick8/gv0        49176     0          Y       9703 
Brick storinator1:/export/brick9/gv0        49177     0          Y       16977
Brick storinator2:/export/brick9/gv0        49177     0          Y       9713 
Brick storinator1:/export/brick10/gv0       49178     0          Y       16996
Brick storinator2:/export/brick10/gv0       49178     0          Y       9718 
Brick storinator1:/export/brick11/gv0       49179     0          Y       17015
Brick storinator2:/export/brick11/gv0       49179     0          Y       9746 
Brick storinator1:/export/brick12/gv0       49180     0          Y       17034
Brick storinator2:/export/brick12/gv0       49180     0          Y       9792 
Brick storinator1:/export/brick13/gv0       49181     0          Y       17053
Brick storinator2:/export/brick13/gv0       49181     0          Y       9755 
Brick storinator1:/export/brick14/gv0       49182     0          Y       17072
Brick storinator2:/export/brick14/gv0       49182     0          Y       9767 
Brick storinator1:/export/brick15/gv0       49183     0          Y       17091
Brick storinator2:/export/brick15/gv0       N/A       N/A        N       N/A  
Brick storinator1:/export/brick16/gv0       49184     0          Y       17110
Brick storinator2:/export/brick16/gv0       49184     0          Y       9791 
Brick storinator1:/export/brick17/gv0       49185     0          Y       17129
Brick storinator2:/export/brick17/gv0       49185     0          Y       9756 
Brick storinator1:/export/brick18/gv0       49186     0          Y       17148
Brick storinator2:/export/brick18/gv0       49186     0          Y       9766 
Brick storinator1:/export/brick19/gv0       49187     0          Y       17167
Brick storinator2:/export/brick19/gv0       49187     0          Y       9745 
Brick storinator1:/export/brick20/gv0       49188     0          Y       17186
Brick storinator2:/export/brick20/gv0       49188     0          Y       9783 
NFS Server on localhost                     2049      0          Y       17206
Self-heal Daemon on localhost               N/A       N/A        Y       17214
NFS Server on storinator2                   2049      0          Y       9657 
Self-heal Daemon on storinator2             N/A       N/A        Y       9677 
Task Status of Volume gv0
Task                 : Rebalance           
ID                   : 28c733e9-d618-44fc-873f-405d3b29a609
Status               : completed           
Wouldn't you know it, within a week or two of pulling the hardware together and getting gluster installed and configured, a disk dies. Note the dead process for brick15 on server storinator2.

I would like to remove (not replace) the failed brick (and its replica).  (I don't have a spare disk handy, and there's plenty of room on the other bricks.)  But gluster doesn't seem to want to remove a brick if the brick is dead:

[root@storinator1 ~]# gluster volume remove-brick gv0 storinator{1..2}:/export/brick15/gv0 start
volume remove-brick start: failed: Staging failed on storinator2. Error: Found stopped brick storinator2:/export/brick15/gv0

So what do I do?  I can't remove the brick while the brick is bad, but I want to remove the brick *because* the brick is bad.  Bit of a Catch-22.

Thanks in advance for any help you can give.

---- ^^^^ ----  End  email cut-n-past ---- ^^^^ ----

Version-Release number of selected component (if applicable):


How reproducible:
I haven't a clue how readily reproducible this is.

Steps to Reproduce:
1. Create a volume like the one described above.
2. Wait for a disk to fail.  (You will, of course, want to force/fake a disk failure, if there's a way to do so.)
3. Attempt to remove the failed brick and its replica

Actual results:
Can't (cleanly) remove failed brick and its replica

Expected results:
Can (cleanly) remove failed brick and its replica

Additional info:

No idea if I got the component right.  A guess based on my very limited understanding of gluster architecture.

Got the job done in a roundabout way.  Tried a remove-brick force.  This worked, but of course resulted in the data on the removed brick being gone from the volume.  But since the replica brick was still sound, I was able to copy the removed replica's contents to the gluster volume mount point.  This cumbersome but effective workaround was the reason I did not select a higher Severity for this bug report.

Comment 1 Phil Dumont 2016-06-15 13:21:14 UTC
For what it's worth, S.M.A.R.T. said the disk failure was due to too many bad sectors.

Comment 2 Kaushal 2017-03-08 10:59:59 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.