1460936 – [Scale] : Rebalance ETA shows the initial estimate to be ~140 days,finishes within 18 hours though.

Bug 1460936 - [Scale] : Rebalance ETA shows the initial estimate to be ~140 days,finishes within 18 hours though.

Summary: [Scale] : Rebalance ETA shows the initial estimate to be ~140 days,finishes w...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Nithya Balachandran
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:	3.3.0-devel-freeze-exception
Depends On:
Blocks:	1417151 1467209 1475192
TreeView+	depends on / blocked

Reported:	2017-06-13 08:05 UTC by Ambarish
Modified:	2017-09-21 04:59 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-36
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1467209 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:59:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Ambarish 2017-06-13 08:05:51 UTC

Description of problem:
-----------------------

This is slightly different than https://bugzilla.redhat.com/show_bug.cgi?id=1457731.

Rebalance ETA showed the initial estimate to be ~140 days at one point :

[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             3899        40.0GB          8162             0             0          in progress        0:33:53
      gqas015.sbu.lab.eng.bos.redhat.com                6       150.0GB           508             0             0          in progress        0:33:53
Estimated time left for rebalance to complete :     3301:23:54
volume rebalance: butcher: success
[root@gqas014 ~]# 


It finished within 18 hours though :

[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost          1058854        72.1GB       5320040             0             0            completed       18:44:51
      gqas015.sbu.lab.eng.bos.redhat.com          1062859       451.6GB       4843484             0             0            completed       18:44:51
volume rebalance: butcher: success
[root@gqas014 ~]# 



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-27

How reproducible:
-----------------

1/1

Additional info:
---------------

[root@gqas014 ~]# gluster v info
 
Volume Name: butcher
Type: Distribute
Volume ID: f297fb8e-f276-4f96-8a58-a1215112d3b2
Status: Started
Snapshot Count: 0
Number of Bricks: 24
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick8: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick10: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick14: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick16: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick20: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick21: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick22: gqas015.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Brick24: gqas015.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
[root@gqas014 ~]#

Comment 3 Ambarish 2017-06-13 08:19:05 UTC

This is rebal ETA at diff intervals :

*Interval1* :

[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              421        40.0GB           973             0             0          in progress        0:12:18
      gqas015.sbu.lab.eng.bos.redhat.com                2        20.0GB           508             0             0          in progress        0:12:18
Estimated time left for rebalance to complete :     1198:26:30
volume rebalance: butcher: success
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              421        40.0GB           973             0             0          in progress        0:12:20
      gqas015.sbu.lab.eng.bos.redhat.com                2        20.0GB           508             0             0          in progress        0:12:20
Estimated time left for rebalance to complete :     1201:41:22
volume rebalance: butcher: success
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              421        40.0GB           973             0             0          in progress        0:12:22
      gqas015.sbu.lab.eng.bos.redhat.com                2        20.0GB           508             0             0          in progress        0:12:22
Estimated time left for rebalance to complete :     1204:56:14
volume rebalance: butcher: success
[root@gqas014 ~]# 




*Interval2* :


[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1958        40.0GB          4137             0             0          in progress        0:21:10
      gqas015.sbu.lab.eng.bos.redhat.com                2        20.0GB           508             0             0          in progress        0:21:10
Estimated time left for rebalance to complete :     2062:21:32
volume rebalance: butcher: success
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1990        40.0GB          4144             0             0          in progress        0:21:11
      gqas015.sbu.lab.eng.bos.redhat.com                2        20.0GB           508             0             0          in progress        0:21:11
Estimated time left for rebalance to complete :     2063:58:58
volume rebalance: butcher: success
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             2052        40.0GB          4210             0             0          in progress        0:21:17
      gqas015.sbu.lab.eng.bos.redhat.com                2        20.0GB           508             0             0          in progress        0:21:17
Estimated time left for rebalance to complete :     2073:43:34
volume rebalance: butcher: success
[root@gqas014 ~]# 


*Interval3*

[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             3894        40.0GB          8096             0             0          in progress        0:33:50
      gqas015.sbu.lab.eng.bos.redhat.com                6       150.0GB           508             0             0          in progress        0:33:50
Estimated time left for rebalance to complete :     3296:31:35
volume rebalance: butcher: success
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             3898        40.0GB          8102             0             0          in progress        0:33:52
      gqas015.sbu.lab.eng.bos.redhat.com                6       150.0GB           508             0             0          in progress        0:33:52
Estimated time left for rebalance to complete :     3299:46:28
volume rebalance: butcher: success
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             3899        40.0GB          8162             0             0          in progress        0:33:53
      gqas015.sbu.lab.eng.bos.redhat.com                6       150.0GB           508             0             0          in progress        0:33:53
Estimated time left for rebalance to complete :     3301:23:54
volume rebalance: butcher: success
[root@gqas014 ~]# 








*Interval4* :

(reverse-i-search)`st': cd /var/log/glu^Cerfs/
[root@gqas014 glusterfs]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            12439        40.1GB         26304             0             0          in progress        1:01:17
      gqas015.sbu.lab.eng.bos.redhat.com             5840       420.8GB         15875             0             0          in progress        1:01:17
Estimated time left for rebalance to complete :      190:05:11
volume rebalance: butcher: success
[root@gqas014 glusterfs]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            12451        40.1GB         26390             0             0          in progress        1:01:20
      gqas015.sbu.lab.eng.bos.redhat.com             5852       420.8GB         15897             0             0          in progress        1:01:20
Estimated time left for rebalance to complete :      189:58:36
volume rebalance: butcher: success
[root@gqas014 glusterfs]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            12459        40.1GB         26391             0             0          in progress        1:01:22
      gqas015.sbu.lab.eng.bos.redhat.com             5857       420.8GB         15907             0             0          in progress        1:01:22
Estimated time left for rebalance to complete :      189:57:35
volume rebalance: butcher: success
[root@gqas014 glusterfs]# 




*Interval 5* :

[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            64709        41.4GB        137889             0             0          in progress        1:35:49
      gqas015.sbu.lab.eng.bos.redhat.com            63986       422.8GB        165367             0             0          in progress        1:35:49
Estimated time left for rebalance to complete :       32:47:45
volume rebalance: butcher: success
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            64767        41.4GB        137939             0             0          in progress        1:35:50
      gqas015.sbu.lab.eng.bos.redhat.com            64014       422.8GB        165526             0             0          in progress        1:35:50
Estimated time left for rebalance to complete :       32:47:21
volume rebalance: butcher: success
[root@gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            64784        41.4GB        137947             0             0          in progress        1:35:59
      gqas015.sbu.lab.eng.bos.redhat.com            64039       422.8GB        165583             0             0          in progress        1:35:59
Estimated time left for rebalance to complete :       32:50:19
volume rebalance: butcher: success
[root@gqas014 ~]# 


As you can see,it shows 3k+ hors for nearly an hour (till Interval 4).

Comment 8 Atin Mukherjee 2017-07-03 09:01:10 UTC

upstream patch : https://review.gluster.org/17668

Comment 9 Atin Mukherjee 2017-07-11 04:41:40 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/111921

Comment 11 Ambarish 2017-07-20 10:30:47 UTC

Current observation on the latest Downstream build : glusterfs-3.8.4-34.el7rhgs.x86_64 :


Rebalance process took 18 hours to finish.

* For the first 2 hours , ETA for reblance is > 7000 hours (again,an expoenential error %age)

* For the last 5 hours , rebalance ETA was ~1500 hours

* In the last hour,it came from 1800 hours to 0 (finish).

                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            22544         2.1GB        358400             0             0          in progress       10:48:55
      gqas007.sbu.lab.eng.bos.redhat.com            22652         5.3GB        357258             0             0          in progress       10:48:55
      gqas016.sbu.lab.eng.bos.redhat.com            20291         2.0GB        400722             0             0            completed        9:13:33
      gqas009.sbu.lab.eng.bos.redhat.com            20310         1.9GB        403938             0             0            completed        9:12:10
Estimated time left for rebalance to complete :     1826:59:28
volume rebalance: butcher: success  

                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            22623         2.1GB        412447             0             0            completed       11:13:22
      gqas007.sbu.lab.eng.bos.redhat.com            22812         5.3GB        412699             0             0            completed       11:13:23
      gqas016.sbu.lab.eng.bos.redhat.com            20291         2.0GB        400722             0             0            completed        9:13:33
      gqas009.sbu.lab.eng.bos.redhat.com            20310         1.9GB        403938             0             0            completed        9:12:10
volume rebalance: butcher: success



Moving this back to Dev for a re-look.

Comment 12 Nithya Balachandran 2017-07-21 05:37:53 UTC

(In reply to Ambarish from comment #11)
> Current observation on the latest Downstream build :
> glusterfs-3.8.4-34.el7rhgs.x86_64 :
> 
> 
> Rebalance process took 18 hours to finish.
> 
> * For the first 2 hours , ETA for reblance is > 7000 hours (again,an
> expoenential error %age)
> 
> * For the last 5 hours , rebalance ETA was ~1500 hours
> 
> * In the last hour,it came from 1800 hours to 0 (finish).
> 
>                                     Node Rebalanced-files          size     
> scanned      failures       skipped               status  run time in h:m:s
>                                ---------      -----------   -----------  
> -----------   -----------   -----------         ------------    
> --------------
>                                localhost            22544         2.1GB     
> 358400             0             0          in progress       10:48:55
>       gqas007.sbu.lab.eng.bos.redhat.com            22652         5.3GB     
> 357258             0             0          in progress       10:48:55
>       gqas016.sbu.lab.eng.bos.redhat.com            20291         2.0GB     
> 400722             0             0            completed        9:13:33
>       gqas009.sbu.lab.eng.bos.redhat.com            20310         1.9GB     
> 403938             0             0            completed        9:12:10
> Estimated time left for rebalance to complete :     1826:59:28
> volume rebalance: butcher: success  
> 
>                                     Node Rebalanced-files          size     
> scanned      failures       skipped               status  run time in h:m:s
>                                ---------      -----------   -----------  
> -----------   -----------   -----------         ------------    
> --------------
>                                localhost            22623         2.1GB     
> 412447             0             0            completed       11:13:22
>       gqas007.sbu.lab.eng.bos.redhat.com            22812         5.3GB     
> 412699             0             0            completed       11:13:23
>       gqas016.sbu.lab.eng.bos.redhat.com            20291         2.0GB     
> 400722             0             0            completed        9:13:33
>       gqas009.sbu.lab.eng.bos.redhat.com            20310         1.9GB     
> 403938             0             0            completed        9:12:10
> volume rebalance: butcher: success
> 
> 
> 
> Moving this back to Dev for a re-look.


The data set consists of very small files (linux untars) which were processed in the beginning so the depending on the size and amount of data moved, the calculations were in fact correct. If did not help that the throttle option was set to normal, so only 2 threads were migrating files.

The process probably sped up once the throttle value was set to aggressive.

I believe the earlier numbers based approach was more accurate and I would like to call out that such a data set is probably still too small - we are targeting scenarios where the rebalance takes days to complete.

I would like to go back to the original approach and to a larger default number of threads for normal and then have QE rerun the tests. We can compare the difference in calculations then.

Comment 15 Atin Mukherjee 2017-07-25 09:09:35 UTC

upstream patch : https://review.gluster.org/#/c/17867/

Comment 18 Nithya Balachandran 2017-07-26 08:05:10 UTC

Upstream release-3.12 patch: https://review.gluster.org/#/c/17873

Comment 23 errata-xmlrpc 2017-09-21 04:59:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.