Bug 1464110 - [Scale] : Rebalance ETA (towards the end) may be inaccurate,even on a moderately large data set.
Summary: [Scale] : Rebalance ETA (towards the end) may be inaccurate,even on a moderat...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1457731
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-22 12:48 UTC by Nithya Balachandran
Modified: 2017-09-05 17:34 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.12.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1457731
Environment:
Last Closed: 2017-09-05 17:34:44 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Nithya Balachandran 2017-06-22 12:52:00 UTC
+++ This bug was initially created as a clone of Bug #1457731 +++

Description:
------------
 
Added bricks to a dist rep volume,ran rebalance.
 
These are the rebalance ETAs at different intervals :
 
[T4 > T3 > T2 > T1]
 
**At time T1**
 
 
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            63949         9.8GB        295287             0             0          in progress        0:34:57
      server2            64644         9.9GB        300745             0             0          in progress        0:34:57
Estimated time left for rebalance to complete :        0:00:38
volume rebalance: butcher: success
 
 
**At time T2**
 
[root@server1 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            64010         9.8GB        295597             0             0          in progress        0:34:58
      server2            64705         9.9GB        300918             0             0          in progress        0:34:58
Estimated time left for rebalance to complete :        0:01:09
 
 
**At Time T3** :
 
[root@server1 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            68057        10.0GB        313569             0             0          in progress        0:36:46
      server2            68904        10.2GB        319823             0             0          in progress        0:36:46
Estimated time left for rebalance to complete :        0:00:09
volume rebalance: butcher: success
[root@server1 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            68110        10.0GB        313882             0             0          in progress        0:36:48
      server2            68958        10.2GB        319948             0             0          in progress        0:36:48
Estimated time left for rebalance to complete :        0:01:10
volume rebalance: butcher: success
 
 
 
**At time T4** // When it finally completed :
 
[root@server1 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            74885       104.4GB        345001             0             0            completed        1:12:32
      server2            74658        10.5GB        345747             0             0            completed        0:39:54
volume rebalance: butcher: success
[root@server1 ~]#
[root@server1 ~]#
 
 
 
So at interval T1,it says ETA for completion is 38 seconds.
 
At T2 it suddenly increased to slightly more than a minute.
 
You can see the same thing happening at T3 interval.
 
So,basically it keeps looping for a while at 1:10 minutes,counts down to 0 and starts with 1:10 again.
 
This continued for another half an hour ,after which it finally completed( You can see the time diff in run time column accross the intervals).
 
 
##NUM_FILES##
[root@gqac011 gluster-mount]# find . -mindepth 1 -type f | wc -l
 
352120


--- Additional comment from Nithya Balachandran on 2017-06-22 06:38:54 EDT ---

RCA:

The rebalance process calculates the file count once at the beginning and then uses the value throughout.

If files are created during the rebalance , the number of files scanned may end up being less than the initially estimated number of files. In that case, rebalance used to just increment the number by 10K and continue. Based on the scan rate in the setup on which the bug was filed that works out to 1 min 10 s.


Now the rebalance process will periodically update the file count. However, this need not make the estimates more accurate as the newly added files may not be processed if the parent dirs have already been processed.

Comment 2 Worker Ant 2017-06-22 12:53:37 UTC
REVIEW: https://review.gluster.org/17607 (cluster/dht: rebalance gets file count periodically) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 3 Worker Ant 2017-06-23 10:12:21 UTC
COMMIT: https://review.gluster.org/17607 committed in master by Raghavendra G (rgowdapp) 
------
commit d66fb14a952729caf51c8328448a548c4d198082
Author: N Balachandran <nbalacha>
Date:   Thu Jun 22 15:56:28 2017 +0530

    cluster/dht: rebalance gets file count periodically
    
    The rebalance used to get the file count in the beginning
    and not update it. This caused estimates to fail
    if the number changed during the rebalance.
    
    The rebalance now updates the file count periodically.
    
    Change-Id: I1667ee69e8a1d7d6bc6bc2f060fad7f989d19ed4
    BUG: 1464110
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/17607
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 4 Shyamsundar 2017-09-05 17:34:44 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.