Problem: ========== On a pure distribute volume which is spread across multi nodes, I have killed a glusterd process of one of the nodes hosting bricks. Now after this, I did a remove-brick of a brick on that node. The remove brick start command starts successfully, but the volume status shows remove-brick as not yet started(which makes complete sense as there is no glusterd to communicate and make rebalance successful) Now when I do a commit of the brick removal, it passes successfully and hence removing brick. This can be a serious data loss issue version ========= [root@tettnang glusterfs]# rpm -qa|grep gluster glusterfs-api-3.7.1-5.el7rhgs.x86_64 glusterfs-libs-3.7.1-5.el7rhgs.x86_64 glusterfs-rdma-3.7.1-5.el7rhgs.x86_64 glusterfs-3.7.1-5.el7rhgs.x86_64 glusterfs-cli-3.7.1-5.el7rhgs.x86_64 glusterfs-debuginfo-3.7.1-5.el7rhgs.x86_64 glusterfs-client-xlators-3.7.1-5.el7rhgs.x86_64 glusterfs-server-3.7.1-5.el7rhgs.x86_64 glusterfs-geo-replication-3.7.1-5.el7rhgs.x86_64 glusterfs-fuse-3.7.1-5.el7rhgs.x86_64 [root@tettnang glusterfs]# gluster --version glusterfs 3.7.1 built on Jun 23 2015 22:08:15 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. Steps to reproduce: ================== 1)Have two nodes A and B 2)Create a distribute volume with one brick on node A and another on B gluster v create dist2 tettnang:/rhs/brick2/dist2 tettnang:/rhs/brick1/dist2 yarrow:/rhs/brick2/dist2 volume create: dist2: success: please start the volume to access data 3)start volume and if you want, Populate some data(i havent populated any) [root@tettnang glusterfs]# gluster v info dist2 Volume Name: dist2 Type: Distribute Volume ID: 833cc008-b234-48d6-a64c-bbc4f18f3d84 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: tettnang:/rhs/brick2/dist2 Brick2: tettnang:/rhs/brick1/dist2 Brick3: yarrow:/rhs/brick2/dist2 Options Reconfigured: performance.readdir-ahead: on [root@tettnang glusterfs]# gluster v status dist2 Status of volume: dist2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick tettnang:/rhs/brick2/dist2 49165 0 Y 24799 Brick tettnang:/rhs/brick1/dist2 49166 0 Y 24821 Brick yarrow:/rhs/brick2/dist2 49163 0 Y 8994 NFS Server on localhost N/A N/A N N/A NFS Server on zod N/A N/A N N/A NFS Server on yarrow N/A N/A N N/A Task Status of Volume dist2 ------------------------------------------------------------------------------ There are no active volume tasks 4)Now, kill the glusterd process of node B 5)Issue vol info(it will show all bricks), issue vol status (it will show only alive brick status) [root@tettnang glusterfs]# gluster v status dist2 Status of volume: dist2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick tettnang:/rhs/brick2/dist2 49165 0 Y 24799 Brick tettnang:/rhs/brick1/dist2 49166 0 Y 24821 NFS Server on localhost N/A N/A N N/A NFS Server on zod N/A N/A N N/A Task Status of Volume dist2 ------------------------------------------------------------------------------ There are no active volume tasks 6)Now issue a "gluster v remove-brick <volname> <node B brick> [root@tettnang glusterfs]# gluster v remove-brick dist2 yarrow:/rhs/brick2/dist2 start volume remove-brick start: success ID: b8a17707-c38d-4ad1-9999-6403e0ae93c4 [root@tettnang glusterfs]# gluster v status dist2 Status of volume: dist2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick tettnang:/rhs/brick2/dist2 49165 0 Y 24799 Brick tettnang:/rhs/brick1/dist2 49166 0 Y 24821 NFS Server on localhost N/A N/A N N/A NFS Server on zod N/A N/A N N/A Task Status of Volume dist2 ------------------------------------------------------------------------------ Task : Remove brick ID : b8a17707-c38d-4ad1-9999-6403e0ae93c4 Removed bricks: yarrow:/rhs/brick2/dist2 Status : not started It can be seen that the remove-brick command starts off successfully, but the remove brick status shows as not started 6)Now issue a commit of remove-brick volume remove-brick start: failed: An earlier remove-brick task exists for volume dist2. Either commit it or stop it before starting a new task. [root@tettnang glusterfs]# gluster v remove-brick dist2 yarrow:/rhs/brick2/dist2 commit Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit: success Check the removed bricks to ensure all files are migrated. Acutal Result ============ Commit passes even though the start command has not even completed Expected Result: ================ Commit should pass only after remove-brick start has passed Another Note: =============== Also, the rebalance log itself has not got created
Created attachment 1043491 [details] server#1 logs sosreports
Created attachment 1043495 [details] server#2 logs sosreports
Sent one possible fix at http://review.gluster.org/#/c/10954/ Here glusterd would check that the brick is in start state. If not then would fail the op.
Upstream patch link : http://review.gluster.org/11726
Downstream patch https://code.engineering.redhat.com/gerrit/#/c/56353/ posted for review
Tested with glusterfs-3.7.1-13.el7rhgs Removing the brick on the node where glusterd is down is not allowed. So the 'remove-brick start' fails with proper error messages In a cluster of host1 and host2, when glusterd in host2 is down, you get the following errors, [root@ ~]# gluster volume remove-brick drvol host1:/rhs/brick1/b1 host2:/rhs/brick1/b1 start volume remove-brick start: failed: Host node of the brick host2:/rhs/brick1/b1 is down Marking this bug as VERIFIED
Atin, Made a few minor edits to the doc text. Please review and sign-off.
Doc text looks good to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1845.html