Description of problem: ======================= On a cifs mount having a dataset of empty directories+ directories with files, started removing few bricks. When issued remove-brick status command, rebalance estimate time shows negative values. I have issued status for almost 21 times during remove-brick rebalance and every time it showed negative values. At the 22nd attempt, the rebalance estimate time showed positive values (at the point, rebalance ran for almost 24 mins) [root@dhcp47-127 samba]# gluster v remove-brick distrep 10.70.47.127:/bricks/brick6/b6 10.70.46.181:/bricks/brick6/b6 10.70.46.47:/bricks/brick6/b6 10.70.47.140:/bricks/brick6/b6 status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 9.5KB 6 0 0 completed 0:15:16 dhcp46-181.lab.eng.blr.redhat.com 0 0Bytes 0 0 0 in progress 0:21:32 dhcp46-47.lab.eng.blr.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 dhcp47-140.lab.eng.blr.redhat.com 0 0Bytes 0 0 0 in progress 0:21:21 Estimated time left for rebalance to complete : 2023406814:-21:-32 Version-Release number of selected component (if applicable): 3.8.4-25.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: =================== 1) Create a distributed-replicate volume and start it. 2) cifs mount the volume on a client. 3) Create a data set of empty directories+ directories with files. 4) Remove few bricks. 5) Keep running remove-brick status command and check "Estimated time left for rebalance to complete " output. Actual results: =============== Rebalance estimate time sometimes shows negative values. Expected results: ================= Rebalance estimate time should not show negative values.
upstream patch : https://review.gluster.org/17448
As updated in https://bugzilla.redhat.com/show_bug.cgi?id=1462181,I still see negative values for rebalance ETA on 3.8.4-28,just before it fails : [root@gqas013 glusterfs]# gluster v rebalance testvol status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 in progress 0:00:00 gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 gqas006.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 gqas008.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 volume rebalance: testvol: success [root@gqas013 glusterfs]# gluster v rebalance testvol status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 in progress 0:00:03 gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 gqas006.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 gqas008.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 Estimated time left for rebalance to complete : 2023406815:00:-3 volume rebalance: testvol: success [root@gqas013 glusterfs]# [root@gqas013 glusterfs]# [root@gqas013 glusterfs]# gluster v rebalance testvol status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 in progress 0:00:06 gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:01 gqas006.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:01 gqas008.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:01 Estimated time left for rebalance to complete : 2023406815:00:-1 volume rebalance: testvol: success [root@gqas013 glusterfs]# gluster v rebalance testvol status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 in progress 0:00:08 gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:03 gqas006.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:03 gqas008.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:03 Estimated time left for rebalance to complete : 2023406815:00:-3 volume rebalance: testvol: success [root@gqas013 glusterfs]# gluster v rebalance testvol status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 in progress 0:00:09 gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:04 gqas006.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:04 gqas008.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:04 Estimated time left for rebalance to complete : 2023406815:00:-4 volume rebalance: testvol: success [root@gqas013 glusterfs]# gluster v rebalance testvol status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 in progress 0:00:10 gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:05 gqas006.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:05 gqas008.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:05 Estimated time left for rebalance to complete : 2023406815:00:-5 volume rebalance: testvol: success rpm -qa|grep glus glusterfs-3.8.4-28.el7rhgs.x86_64 I am moving this back to Dev for a relook.
upstream patch : https://review.gluster.org/#/c/17564/
Now, rebalance status will not show the estimate if the rebalance process cannot calculate the values. Scenarios where this can happen is when the rebalance process is unable to get the rate at which the files are processed (before a failure as in the test in comment#10)
upstream patch : https://review.gluster.org/17863
upstream 3.12 patch : https://review.gluster.org/17882 downstream patch : https://code.engineering.redhat.com/gerrit/#/c/113576
Neither Prasad nor I could hit in in our testing on latest downstream bits. I am moving this BZ to Verified. Will reopen again,if I hit it at a later time.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774