Description of problem: ======================= Had two node cluster (node-1 and node-2) with Distributed volume (1*2), mounted it as fuse and started IO, during IO in progress, started remove brick operation and restart glusterd on the node which is hosting the brick to remove, after glusterd restart there is not rebalance info displaying like "Rebalanced-files, size, scanned" all the things it's showing as zeros. Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.7.5-14 How reproducible: ================= Always Steps to Reproduce: =================== 1.Have a two node cluster (node-1 and node-2) 2.Create a Distributed volume using both the node bricks (1*2) 3.Mounted the volume as Fuse and start IO 4. When IO is in progress, start the remove brick of node-2. 5. Check the remove brick status // it will show the rebalance info 6. Stop and start the glusterd on node-2 7. Check the remove brick status again on both the nodes //it won't show the rebalance info. Actual results: =============== No rebalance info displaying after glusterd restart Expected results: ================= It should show Rebalance info even after glusterd restart. Additional info:
To add to this, detach-tier status doesn't show any stats though the log shows the progress of files being migrated [root@transformers ~]# gluster v detach-tier dpvol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- tettnang.lab.eng.blr.redhat.com 0 0Bytes 0 0 0 in progress 0.00 [root@transformers ~]#
RCA: remove brick operation when in progress is determined by a flag 'decommission_is_in_progress' in volume. This flag doesn't get persisted though and because of which on a glusterd restart the information is lost and all such validations of blocking remove brick commit when rebalance is in progress is skipped through. I agree with QE that this is a potential data loss situation and should be considered as *blocker*. I've posted a fix in upstream http://review.gluster.org/#/c/13323/
Oops, I made a mistake here, I was supposed to put this analysis for another bug. Please ignore #comment 9. Moving back status to 'New'
Upstream mainline : http://review.gluster.org/14827 Upstream 3.8 : http://review.gluster.org/14856 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.
Verified this BZ against glusterfs version: 3.8.4-2.el7rhgs.x86_64. Here are the steps that were performed, 1) Created a Distributed replica volume and started it. 2) FUSE Mounted the volume and start IO. 3) While IO is in progress, started removing few bricks. 4) Checked the remove brick status and it is showing rebalance info. 5) Stopped and started the glusterd on the nodes where bricks were removed. 6) Checked the remove brick status again on all the nodes and the rebalance info is being displayed. Hence, moving this BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html