Description of problem: ======================= Have two node cluster with Distributed-Replica volume and mounted as fuse with enough data and started removing replica brick set which triggered rebalance, during rebalance in progress, restarted glusterd on a node from where data migration is happening, after that tried to commit the remove-brick, it's get committed even though data migration not completed. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.5-17 How reproducible: ================= Every time Steps to Reproduce: ==================== 1.Have a two node cluster with Distributed-Replica volume (2 *2 ) 2.Mount the volume as Fuse and write enough data 3.Start replica brick set remove // will trigger the data migration 4.Using remove-brick status identify brick node from where data migration is happening. 5. Restart glusterd on the node identified in step-4 during rebalance in progress 6.Try to commit the remove-brick //commit will happen with out fail. Actual results: =============== remove-brick commit happens even though rebalance not completed. Expected results: ================= remove-brick commit should not happen when rebalance is in progress. Additional info:
RCA: remove brick operation when in progress is determined by a flag 'decommission_is_in_progress' in volume. This flag doesn't get persisted though and because of which on a glusterd restart the information is lost and all such validations of blocking remove brick commit when rebalance is in progress is skipped through. I agree with QE that this is a potential data loss situation and should be considered as *blocker*. I've posted a fix in upstream http://review.gluster.org/#/c/13323/
Workaround for this bug is that after restarting glusterd and before performing remove-brick commit user should check remove-brick status. If the remove brick status is in progress then user should not perform remove-brick commit operation.
I don't think #comment 5 is valid until and unless we pull in https://bugzilla.redhat.com/show_bug.cgi?id=1302968 . On a glusterd restart as per the current code it can never connect to the ongoing rebalance daemon which means the statistics are stale. So executing remove brick status after glusterd restart can not indicate the rebalance completion status of all the nodes with the current code.
Yes Atin, #comment 5 is valid only when https://bugzilla.redhat.com/show_bug.cgi?id=1302968 pulled in.
Looks good now :)
The fix is now available in rhgs-3.1.3 branch, hence moving the state to Modified.
Verified this bug using the build "glusterfs-3.7.9-1" Repeated the reproducing steps mentioned in description section, Fix is working properly, it's not allowing to commit the remove-brick operation when data migration is in progress after glusterd restart. and rebalance will continue after glusterd restart as well. With these details, moving this bug to next state,
LGTM :) but why the flag is moved to '?'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240