Description of problem: On a replica 3 volume, which is the default volume type supported for CNS, when device remove is run on all three devices (before self-heal is completed) under which the volume is carved, we might end up in a data loss. As we are doing a replace brick force internally, there is no check on whether there is an on-going heal and hence when the last brick with latest data is also replaced forcefully, we'll lose the data. So, heketi has to validate that there are no active healing going on before proceeding with the replace brick. Version-Release number of selected component (if applicable): heketi-client-4.0.0-1.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. create a 100Gb pvc and write 50 Gb of data. 2. Find out the three devices on which the volume is built. 3. Run heketi device remove <node1device1> && heketi device remove <node1device2> && heketi device remove <node1device3>. This can also be run serially, waiting for each command to complete, but before self-heal is done. Actual results: remove brick will proceed without checking on-going selfheal Expected results: remove brick should not proceed when there is a self-heal going on in a volume. Additional info:
Note that it would *actually* be Gluster's job to do this protection! We should file a RFE for gluster itself. (And then possibly still add the protection to heketi until that's available).
Filed a RHGS bug https://bugzilla.redhat.com/show_bug.cgi?id=1432969
https://github.com/heketi/heketi/pull/718
(In reply to Raghavendra Talur from comment #8) > https://github.com/heketi/heketi/pull/718 Merged upstream.
verified on build heketi-client-5.0.0-11.el7rhgs.x86_64, cns-deploy-5.0.0-37.el7rhgs.x86_64
I have provided doc text, please review.
doc text looks good to me
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2879