Created attachment 1168372 [details] compressed (bzip2) tarball of /var/log/glusterfs on server with failed disk Description of problem: The following is cut-n-paste from an email to gluster-users. I received a reply saying it looked like a bug and would I please submit a BZ. So here it is. ---- vvvv ---- Begin email cut-n-past ---- vvvv ---- Just started trying gluster, to decide if we want to put it into production. Running version 3.7.11-1 Replicated, distributed volume, two servers, 20 bricks per server: [root@storinator1 ~]# gluster volume status gv0 Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick storinator1:/export/brick1/gv0 49153 0 Y 2554 Brick storinator2:/export/brick1/gv0 49153 0 Y 9686 Brick storinator1:/export/brick2/gv0 49154 0 Y 2562 Brick storinator2:/export/brick2/gv0 49154 0 Y 9708 Brick storinator1:/export/brick3/gv0 49155 0 Y 2568 Brick storinator2:/export/brick3/gv0 49155 0 Y 9692 Brick storinator1:/export/brick4/gv0 49156 0 Y 2574 Brick storinator2:/export/brick4/gv0 49156 0 Y 9765 Brick storinator1:/export/brick5/gv0 49173 0 Y 16901 Brick storinator2:/export/brick5/gv0 49173 0 Y 9727 Brick storinator1:/export/brick6/gv0 49174 0 Y 16920 Brick storinator2:/export/brick6/gv0 49174 0 Y 9733 Brick storinator1:/export/brick7/gv0 49175 0 Y 16939 Brick storinator2:/export/brick7/gv0 49175 0 Y 9739 Brick storinator1:/export/brick8/gv0 49176 0 Y 16958 Brick storinator2:/export/brick8/gv0 49176 0 Y 9703 Brick storinator1:/export/brick9/gv0 49177 0 Y 16977 Brick storinator2:/export/brick9/gv0 49177 0 Y 9713 Brick storinator1:/export/brick10/gv0 49178 0 Y 16996 Brick storinator2:/export/brick10/gv0 49178 0 Y 9718 Brick storinator1:/export/brick11/gv0 49179 0 Y 17015 Brick storinator2:/export/brick11/gv0 49179 0 Y 9746 Brick storinator1:/export/brick12/gv0 49180 0 Y 17034 Brick storinator2:/export/brick12/gv0 49180 0 Y 9792 Brick storinator1:/export/brick13/gv0 49181 0 Y 17053 Brick storinator2:/export/brick13/gv0 49181 0 Y 9755 Brick storinator1:/export/brick14/gv0 49182 0 Y 17072 Brick storinator2:/export/brick14/gv0 49182 0 Y 9767 Brick storinator1:/export/brick15/gv0 49183 0 Y 17091 Brick storinator2:/export/brick15/gv0 N/A N/A N N/A Brick storinator1:/export/brick16/gv0 49184 0 Y 17110 Brick storinator2:/export/brick16/gv0 49184 0 Y 9791 Brick storinator1:/export/brick17/gv0 49185 0 Y 17129 Brick storinator2:/export/brick17/gv0 49185 0 Y 9756 Brick storinator1:/export/brick18/gv0 49186 0 Y 17148 Brick storinator2:/export/brick18/gv0 49186 0 Y 9766 Brick storinator1:/export/brick19/gv0 49187 0 Y 17167 Brick storinator2:/export/brick19/gv0 49187 0 Y 9745 Brick storinator1:/export/brick20/gv0 49188 0 Y 17186 Brick storinator2:/export/brick20/gv0 49188 0 Y 9783 NFS Server on localhost 2049 0 Y 17206 Self-heal Daemon on localhost N/A N/A Y 17214 NFS Server on storinator2 2049 0 Y 9657 Self-heal Daemon on storinator2 N/A N/A Y 9677 Task Status of Volume gv0 ------------------------------------------------------------------------------ Task : Rebalance ID : 28c733e9-d618-44fc-873f-405d3b29a609 Status : completed Wouldn't you know it, within a week or two of pulling the hardware together and getting gluster installed and configured, a disk dies. Note the dead process for brick15 on server storinator2. I would like to remove (not replace) the failed brick (and its replica). (I don't have a spare disk handy, and there's plenty of room on the other bricks.) But gluster doesn't seem to want to remove a brick if the brick is dead: [root@storinator1 ~]# gluster volume remove-brick gv0 storinator{1..2}:/export/brick15/gv0 start volume remove-brick start: failed: Staging failed on storinator2. Error: Found stopped brick storinator2:/export/brick15/gv0 So what do I do? I can't remove the brick while the brick is bad, but I want to remove the brick *because* the brick is bad. Bit of a Catch-22. Thanks in advance for any help you can give. ---- ^^^^ ---- End email cut-n-past ---- ^^^^ ---- Version-Release number of selected component (if applicable): Huh? How reproducible: I haven't a clue how readily reproducible this is. Steps to Reproduce: 1. Create a volume like the one described above. 2. Wait for a disk to fail. (You will, of course, want to force/fake a disk failure, if there's a way to do so.) 3. Attempt to remove the failed brick and its replica Actual results: Can't (cleanly) remove failed brick and its replica Expected results: Can (cleanly) remove failed brick and its replica Additional info: No idea if I got the component right. A guess based on my very limited understanding of gluster architecture. Got the job done in a roundabout way. Tried a remove-brick force. This worked, but of course resulted in the data on the removed brick being gone from the volume. But since the replica brick was still sound, I was able to copy the removed replica's contents to the gluster volume mount point. This cumbersome but effective workaround was the reason I did not select a higher Severity for this bug report.
For what it's worth, S.M.A.R.T. said the disk failure was due to too many bad sectors.
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.