1381142 – The rebal-throttle setting does not work as expected

Bug 1381142 - The rebal-throttle setting does not work as expected

Summary: The rebal-throttle setting does not work as expected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Susant Kumar Palai
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1369781 1417145 1417147 1420166
TreeView+	depends on / blocked

Reported:	2016-10-03 08:50 UTC by tsugano
Modified:	2023-09-14 03:31 UTC (History)
CC List:	16 users (show)
Fixed In Version:	glusterfs-3.8.4-26
Doc Type:	Bug Fix
Doc Text:	Rebalance throttling previously had three modes; however, two of these modes had similar performance because of bottlenecks in thread management. This has been corrected. In addition, rebalance throttling can now be configured by specifying a number of migration threads to use as well as with the previous lazy, normal, and aggressive modes.
Clone Of:
Clones:	1420166 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:28:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Disk/network stat for rebalance in aggressive mode (395.61 KB, application/x-gzip) 2016-11-02 07:24 UTC, Susant Kumar Palai	no flags	Details
Disk/network/cpu stat for lazy,normal,aggressive (272.23 KB, application/x-gzip) 2017-01-31 04:21 UTC, Susant Kumar Palai	no flags	Details
iostat-report-for-10threads(normal) (3.49 KB, application/x-gzip) 2017-01-31 04:40 UTC, Susant Kumar Palai	no flags	Details
Rebalance perf report(iostat/sar/cpu) for 1GB data (4.80 MB, application/x-gzip) 2017-02-02 11:12 UTC, Susant Kumar Palai	no flags	Details
Rebal-stat with 4 server run for server 45,46 (17.65 MB, application/x-tar) 2017-02-23 12:51 UTC, Susant Kumar Palai	no flags	Details
Rebal-stat with 4 server run for server 47,48 (17.65 MB, application/x-tar) 2017-02-23 12:53 UTC, Susant Kumar Palai	no flags	Details
iostat-report-for-20threads with crawl patch (102.57 KB, text/plain) 2017-03-01 08:51 UTC, Susant Kumar Palai	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description tsugano 2016-10-03 08:50:27 UTC

Description of problem:
After adding 160TB, which consists of 4 bricks on 2 servers, to an 800TB volume, I started rebalance on the volume with 400TB populated. Then I changing the cluster.rebal-throttle setting from normal to aggressive/lazy on the volume in 800TB size with 400TB populated 

Version-Release number of selected component (if applicable):
RHGS 3.1.3

How reproducible:


Steps to Reproduce:
1. Create a 800TB distributed replicated volume with 100s of millions of files populated.
2. Add 160TB bricks to the volume
3. Start rebalance in the volume.
4. Change cluster.rebal-throttle from/to any values.
5. Watch the output from gluster volume rebalance <vol> status.

Actual results:
Changing the throttle will make no difference on the pace of rebalance.

Expected results:
The speed of rebalance gets faster as the throttle is changed from 'lazy' via 'normal' to 'aggressive', and vice versa.

Additional info:

Comment 2 Susant Kumar Palai 2016-10-13 09:48:55 UTC

(In reply to tsugano from comment #0)
> Description of problem:
> After adding 160TB, which consists of 4 bricks on 2 servers, to an 800TB
> 
> Actual results:
> Changing the throttle will make no difference on the pace of rebalance.

  Do you mean even when you change throttle from lazy to normal/aggressive you dont see any change in rebalance pace? What was the measure of rebalance pace here?
> 
> Expected results:
> The speed of rebalance gets faster as the throttle is changed from 'lazy'
> via 'normal' to 'aggressive', and vice versa.
> 
> Additional info:

Hi,
 Can you post the rebalance logs here?

Regards,
Susant

Comment 3 Susant Kumar Palai 2016-10-13 09:55:21 UTC

Tsugano,
  I would also like to know the initial cluster setup(volume config) and whether the machines were serving other workloads(if yes then small brief about what kind of workload).

Comment 4 Susant Kumar Palai 2016-10-13 11:19:16 UTC

Just to confirm, did a local check to make sure that throttle kicks in with the set option and it is behaving as expected on rhgs-3.1.3.

Comment 5 Susant Kumar Palai 2016-10-14 10:22:04 UTC

Forgot to mention the steps for comment 4.

1- Created 20k files to a 2 * 1 dist only volume, so that will get time to change throttle option. 

2- Added 10 bricks so that lots of files will be eligible for migration.

Logs confirming rebalance went to "normal" mode.
[2016-10-13 11:11:15.895624] I [dht-rebalance.c:3586:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2016-10-13 11:11:15.895653] I [dht-rebalance.c:2102:gf_defrag_task] 0-DHT: Thread sleeping. defrag->current_thread_count: 3
[2016-10-13 11:11:15.895659] I [dht-rebalance.c:3586:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2016-10-13 11:11:15.895688] I [dht-rebalance.c:2102:gf_defrag_task] 0-DHT: Thread sleeping. defrag->current_thread_count: 2

* My machine has 2 cores, so it starts with 4 threads and normal mode will bring it down to 2 threads.

3- Now set the throttle option to aggressive.

Logs from rebalance

[2016-10-13 11:11:49.761538] I [dht-rebalance.c:2113:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3
[2016-10-13 11:11:49.761906] I [dht-rebalance.c:2113:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4

Now thread count is 4, confirming that throttle did work.

Comment 6 Susant Kumar Palai 2016-10-14 12:48:41 UTC

Did a similar test to verify a transaction from aggressive to lazy mode.

1- Started rebalance with aggressive mode.

logs: [2016-10-14 12:41:43.274571] I [dht-rebalance.c:3586:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2016-10-14 12:41:43.274598] I [dht-rebalance.c:3586:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2016-10-14 12:41:43.274622] I [dht-rebalance.c:3586:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2016-10-14 12:41:43.274643] I [dht-rebalance.c:3586:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful

4 threads were created successfully (since my machine has 2 cores).


2-  Now set the throttle option to lazy.

Could see thread number going down to 1.

logs: [2016-10-14 12:41:47.437288] I [dht-rebalance.c:2102:gf_defrag_task] 0-DHT: Thread sleeping. defrag->current_thread_count: 3
[2016-10-14 12:41:47.437960] I [MSGID: 109022] [dht-rebalance.c:1587:dht_migrate_file] 0-test1-dht: completed migration of /file333 from subvolume test1-client-3 to test1-client-16
[2016-10-14 12:41:47.438623] I [MSGID: 109022] [dht-rebalance.c:1587:dht_migrate_file] 0-test1-dht: completed migration of /file313 from subvolume test1-client-4 to test1-client-23
[2016-10-14 12:41:47.439367] I [dht-rebalance.c:2102:gf_defrag_task] 0-DHT: Thread sleeping. defrag->current_thread_count: 2
[2016-10-14 12:41:47.440420] I [dht-rebalance.c:2102:gf_defrag_task] 0-DHT: Thread sleeping. defrag->current_thread_count: 1

Only 1 thread is active.

So this verifies atleast that throttle kicks in properly.

Will do further analysis and update after getting perf machines.

Comment 7 tsugano 2016-10-17 01:13:51 UTC

>  Do you mean even when you change throttle from lazy to normal/aggressive you dont see any change in rebalance pace? What was the measure of rebalance pace here?

Right, we were watching the output from 'gluster volume rebalance status' every 15min during rebalance. See the history below:

https://docs.google.com/a/redhat.com/spreadsheets/d/1IWwl5cXl26FMVLzrqxbncLnNYtFPWHhan4Za9yDdEi8/edit?usp=sharing

Sampling data was picked up 1-hour intervals only but I can share more granular ones if needed.

> Can you post the rebalance logs here?

Can you check the support#01707385 below and you will find attached sosreports of the RHGS nodes we configured for the testing:

https://access.redhat.com/support/cases/#/case/01707385

Regarding a local check, we did not do that due to time constrain. However, I don't believe the number of rebalance threads were short, rather the thing is the pace of rebalance had not been changed even after appropriate number of threads were invoked by changing the throttle, probably because the scanner in rebalance was the bottleneck according to the eng who supported our PoC. Please contact Manoj and/or Sankarshan in BU for more details.

Comment 9 Susant Kumar Palai 2016-10-24 07:23:48 UTC

So here is my initial test result from a single server with two bricks. 

Created 50k files and renamed them so that all the files will have a linkto file and they will be migrated for sure with "force option".

Machine: 24 cores
[root@gprfs039 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Stepping:              7
CPU MHz:               1199.921
BogoMIPS:              4003.69
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23


Here are the results, one from "normal" and the other from "aggressive".

normal: (This will run on (24-4) / 2 = 10  threads)
[root@gprfs039 mnt]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB        100000             0             0            completed        0:21:55
volume rebalance: test1: success


aggressive: this will run on (24-4) = 20 threads )
   [root@gprfs039 mnt]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB        100000             0             0            completed        0:21:33
volume rebalance: test1: success


From the test it looks like surely doubling threads count not helping in anyway for the migration. Since with 20 parallel threads, we are inducing 40 parallel IOs (disk/network), I will do the same test on a two node setup, with one brick on one node. And will see if that does make a difference.

Will update here soon.

Comment 10 Susant Kumar Palai 2016-10-24 09:39:43 UTC

Here comes the test results followed from comment 9.


Aggressive mode:

  [root@gprfs039 ~]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               gprfs039            50000        46.6GB         50000             0             0            completed        0:14:49
                                gprfs040                0        0Bytes         50000             0             0            completed        0:0:48
volume rebalance: test1: success

Normal mode:
[root@gprfs039 mnt]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            47198        44.0GB         47663             0             0          in progress        0:14:1
                                gprfs040                0        0Bytes         50000             0             0            completed        0:0:34
volume rebalance: test1: success

So there is clearly no improvement (though the performance is better than the single node setup).

But in addition I have also collected the starvation count from the above runs.
This count is the number of times a thread complained about not finding any entry in the global migration queue. And the results are interesting.

Normal mode starvation count: 10

Aggressive mode starvation count: 29986 (Woho!)

So this clearly indicates the CRAWL is not fast enough for the aggressive mode and not able to feed files to the migrating threads faster.

While I am updating this report, I am running the same tests above another time to confirm the starvation count(as the difference is too huge) and will update here.

Stay tuned :)

Comment 11 Susant Kumar Palai 2016-10-24 12:15:49 UTC

Well the second run result is here for aggressive ("normal" will be updated).

aggressive:
 [root@gprfs039 ~]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:17:36
                                gprfs040                0        0Bytes         50000             0             0            completed        0:0:38
volume rebalance: test1: success

starvation count: [2016-10-24 12:10:31.646428] I [dht-rebalance.c:3860:gf_defrag_start_crawl] 0-test1-dht: Starvation count:34465

This confirms the data from comment 10 and indicates we for sure need to re-engineer the Crawler. I will keep udpating about the progress.

Regards,
Susant

Comment 12 Susant Kumar Palai 2016-10-24 15:52:01 UTC

Here is the result from "normal" run.

[root@gprfs039 ~]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:17:4
                                gprfs040                0        0Bytes         50000             0             0            completed        0:0:31
volume rebalance: test1: success
[root@gprfs039 ~]# 

Starvation count:10

Comment 13 Susant Kumar Palai 2016-10-26 05:59:49 UTC

To remove the bottleneck in crawl(just to verify), I removed the lookup needed before pushing an entry to the migration queue. This will not hamper the rebalance in my test case, since all the entries on one node will be migrated for sure.

Then when I ran rebalance in _aggressive_ mode, still it did not give any performance improvement over _normal_ mode. 

But the interesting fact is the starvation count with _aggressive_ mode came down to 20. Since this still did not give any perf boost with aggressive mode, we need to figure out whether disk and network bandwidth were saturated enough to give any boost at all. 

Will discuss this with Manoj to get info on how to test the disk/network usage.

Regards,
Susant

Comment 14 Manoj Pillai 2016-10-26 06:14:28 UTC

(In reply to Susant Kumar Palai from comment #13)

I checked the setup and the current config is using a single disk and using the 1G network. You should change to using the 10GbE link and a 12-disk RAID-6 device for performance tests.

But the part about starvation count dropping drastically is good to hear.

Comment 15 Susant Kumar Palai 2016-10-26 06:30:31 UTC

(In reply to Manoj Pillai from comment #14)
> (In reply to Susant Kumar Palai from comment #13)
> 
> I checked the setup and the current config is using a single disk and using
> the 1G network. You should change to using the 10GbE link and a 12-disk
> RAID-6 device for performance tests.

Will test out again after the right config. Thanks Manoj for pointing this out.

> 
> But the part about starvation count dropping drastically is good to hear.
Yes, this tells us that lookups are slowing down the crawl. We can't avoid them but can optimize for sure.

Comment 16 Susant Kumar Palai 2016-10-27 13:21:22 UTC

After doing the config update to 10G E link and 12 disk RAID-6, here is the result.

aggressive: 
 [root@gprfs039 Containerized_gluster]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:6:59
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:46
volume rebalance: test1: success
Starvation count:20

normal:
[root@gprfs039 glusterfs]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:8:11
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:43
volume rebalance: test1: success
Starvation count:10


"Aggressive" seems to be little bit faster than "normal". 

Will check disk-usage and network-usage during aggressive mode and update here.

Comment 17 Susant Kumar Palai 2016-11-02 07:24:34 UTC

Created attachment 1216375 [details]
Disk/network stat for rebalance in aggressive mode

The stat is collected on the  same setup as before (two node, 50K files each of size 1MB, lookups for entries removed before being added to migration queue).

Comment 18 Manoj Pillai 2016-11-02 08:22:09 UTC

Now that your setup has been changed to use 10GbE and RAID-6 brick, can you please post the rebalance times for the following cases?

Time for Rebalance to complete with existing code:
Normal mode:
Aggressive mode:

Time for Rebalance to complete with lookup removed:
Normal mode:
Aggressive mode:

Comment 19 Susant Kumar Palai 2016-11-02 08:59:33 UTC

(In reply to Manoj Pillai from comment #18)
> Now that your setup has been changed to use 10GbE and RAID-6 brick, can you
> please post the rebalance times for the following cases?
> 
> Time for Rebalance to complete with existing code:
> Normal mode:
> Aggressive mode:
Will update this one soon.
> 
> Time for Rebalance to complete with lookup removed:
> Normal mode:
> Aggressive mode:
This is there in comment 16.

Comment 20 Susant Kumar Palai 2016-11-03 06:54:45 UTC

Ran tests with existing code and new config as asked by Manoj.

Here is the result.

aggressive:
[root@gprfs039 ~]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:7:37
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:49
volume rebalance: test1: success

[2016-11-02 11:17:05.197401] I [dht-rebalance.c:3860:gf_defrag_start_crawl] 0-test1-dht: Starvation count:16760


normal:
[root@gprfs039 ~]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:6:42
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:48
volume rebalance: test1: success

starvation count: 10

* Aggressive performing bad than normal.


And also ran the the same test with chunk size of 1MB(existing code). And here is the result.

aggressive

[root@gprfs039 glusterfs]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:6:42
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:52
volume rebalance: test1: success
Starvation count:50000

normal

[root@gprfs039 ~]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:7:46
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:43
volume rebalance: test1: success

starvation count: 10

* Seems chunk size helped aggressive a little.


To accumulate better comparison, running test with 50k files, with 100MB each(that's total of 5TB data), no lookups, 1MB chunk size.


Will update the result soon.

Comment 21 Susant Kumar Palai 2016-11-04 05:18:05 UTC

Test: no lookup, 100MB file size, 50000 files

aggressive:
[root@gprfs039 ~]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000         4.5TB         50000             0             0            completed        4:43:34
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:1:13
volume rebalance: test1: success

[2016-11-03 03:39:18.802543] I [dht-rebalance.c:3846:gf_defrag_start_crawl] 0-test1-dht: Starvation count:20

normal:
[root@gprfs039 ~]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000         4.5TB         50000             0             0            completed        4:52:4
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:1:22
volume rebalance: test1: success
[root@gprfs039 ~]# 


starvation count: 10

Comment 22 Susant Kumar Palai 2016-11-04 07:18:00 UTC

Let's summarize why and what we have tested so far and the conclusions.

The setup I have taken is 2 node, pure distribute(let's eliminate lower translators from the picture).

What I wanted is all the files existing to be migrated, so that the rebal-stat should be absolute and comparable. It did not seem time efficient to create e.g 100K files, add-brick and expect 50K files to be migrated, since there will be time spent in creation and 50K extra lookups. So I preferred to create 50K files and rename such that all the files will have different hash and cache subvolume (ran a script to generate such 50K file names and have it saved for future use), hence, 50K migration eligible files. So at the end, one node has all 50K data/cached files and the other node has all 50K linkto files. And the expectation would be _aggressive_ should finish the job twice as faster as _normal_.

One advantage of the above setup during my above tests was I could avoid lookups i.e. needed to figure out whether they are eligible for migration or not, as I know that all the 50K files from one node will be migrated for sure.

I also collected the "starvation count". Any time if a migration thread does not find any entry in the queue, it will be recorded. The number of such events through out the rebalance is called "starvation count".

I will summarize only the tests been executed on 10Gbps Link + Raid-6 setup.

Machine: 24 cores
aggressive thread count: 20
normal thread count: 10

Test-case 1: Existing code(no change), size-of-file: 1MB, file-number: 50K
-aggressive:
-time 0:7:37
-starvation count 16760

-normal:
-time 0:6:42
-starvation count 10

-conclusion:
-one- this info confirms the bug that normal to aggressive throttling does not give perf boost.
-two- huge starvation count in aggressive suggested crawler is not able to feed the queue fast enough.

From the above test case I guessed may be lookups (there are two, one to figure out if the file is a linkto file and the second to figure out eligibility of the file for migration) are slowing down the crawler and let's remove the above two lookups(as I mentioned earlier the above setup helped in this case).

Test-case 2: Modified code to remove the two lookups, size-of-file: 1MB, file-number: 50K

-aggressive:
-time 0:6:59
-starvation count 20

-normal:
-time 0:8:11
-starvation count 10

-conclusion: The starvation count in aggressive mode reduced to 20 from 16760. Well that justifies lookup slowing down crawler. BUT the perf difference was not very high.

No perf difference suggested the problem is somewhere else. The rebalance file read size was 128K. May be it would be better to try with 1MB read size.

Test-case 3: Test-case 2, with chunk size: 1MB, file-number: 50K, size-of-file: 100MB leading to 5TB data

-aggressive:
-time 4:43:34
-starvation count 20

-normal:
-time 4:52:4
-starvation count 10

-conclusion: So clearly chunk size is not to be blamed.

So the whole picture suggests crawler, lookup, chunk size are not the bottlenecks for sure.

Another thing I am thinking is to figure out whether creating 20 threads in a 24 core setup is ideal. With 20 threads there will be 20 parallel Disk/Network IO. There will be sync threads which can scale up to 16 in rebal process, IO threads for bricks for serving requests. There may be other variables as well which we need to figure out.

Shyam, would like your input on this.

Thanks,
Susant

Comment 23 Shyamsundar 2016-11-07 15:40:05 UTC

(In reply to Susant Kumar Palai from comment #22)
> Let's summarize why and what we have tested so far and the conclusions.
> 
> The setup I have taken is 2 node, pure distribute(let's eliminate lower
> translators from the picture).
> 
> I will summarize only the tests been executed on 10Gbps Link + Raid-6 setup.
> 
> Machine: 24 cores
> aggressive thread count: 20
> normal thread count: 10
> 
> Test-case 1: Existing code(no change), size-of-file: 1MB, file-number: 50K
>  -aggressive:
>   -time 0:7:37
>   -starvation count 16760
> 
>  -normal:
>   -time 0:6:42
>   -starvation count 10
> 
>  -conclusion:
>   -one- this info confirms the bug that normal to aggressive throttling does
> not give perf boost.
>   -two- huge starvation count in aggressive suggested crawler is not able to
> feed the queue fast enough.
> 
> From the above test case I guessed may be lookups (there are two, one to
> figure out if the file is a linkto file and the second to figure out
> eligibility of the file for migration) are slowing down the crawler and
> let's remove the above two lookups(as I mentioned earlier the above setup
> helped in this case).
> 
> Test-case 2: Modified code to remove the two lookups, size-of-file: 1MB,
> file-number: 50K
> 
>  -aggressive:
>   -time  0:6:59
>   -starvation count 20
> 
>  -normal:
>   -time 0:8:11
>   -starvation count 10
> 
>  -conclusion: The starvation count in aggressive mode reduced to 20 from
> 16760. Well that justifies lookup slowing down crawler. BUT the perf
> difference was not very high.
> 
> No perf difference suggested the problem is somewhere else. The rebalance
> file read size was 128K. May be it would be better to try with 1MB read
> size. 
> 
> Test-case 3: Test-case 2, with chunk size: 1MB, file-number: 50K,
> size-of-file: 100MB leading to 5TB data
> 
>  -aggressive: 
>   -time  4:43:34
>   -starvation count 20
> 
>  -normal:
>   -time 4:52:4
>   -starvation count 10
> 
>  -conclusion: So clearly chunk size is not to be blamed. 
> 
> So the whole picture suggests crawler, lookup, chunk size are not the
> bottlenecks for sure.
> 
> Another thing I am thinking is to figure out whether creating 20 threads in
> a 24 core setup is ideal. With 20 threads there will be 20 parallel
> Disk/Network IO. There will be sync threads which can scale up to 16 in
> rebal process, IO threads for bricks for serving requests. There may be
> other variables as well which we need to figure out.

Was there any bottleneck? Like disk IO being 100% busy or other (network usage etc.). Why I ask is, if we are bottlenecked somewhere else, then the thread count changes, or attempting more work may not solve the problem.

Do we have that data?

<Still leaving this as NEEDINFO, so that we can discuss>

Comment 24 Susant Kumar Palai 2016-11-08 09:53:37 UTC

Shyam, we have the data. The disk and network stats are attached in comment 17. The files are prefixed with server id. e.g. 39-iostat_rebal represents server 39.

All the source/data files were present in 39. And the destination was 40.

Pasting some part of the data collected.
sdb was where the bricks resided in both 39 and 40. And the both the servers has p4p1 as network interface for 10G Ethernet link.

On average on Machine 39(source)
disk read speed: 101128.40 KB/s =~ 100MB/s 
network writes : 103045.13KB/s

On average on Machine 40(source)
disk write speed: 104006.95 KB/s
network reads: 104006.95 KB/s

I don't see any network/disk bandwidth congestion.

Please review once as I might have missed something.

Thanks,
Susant

Comment 25 Susant Kumar Palai 2016-11-08 10:07:19 UTC

Posted a patch for crawler perf improvement here: http://review.gluster.org/#/c/15781/.

Comment 26 Shyamsundar 2016-11-13 23:44:06 UTC

(In reply to Susant Kumar Palai from comment #24)
> Shyam, we have the data. The disk and network stats are attached in comment
> 17. The files are prefixed with server id. e.g. 39-iostat_rebal represents
> server 39.
> 
> All the source/data files were present in 39. And the destination was 40.
> 
> Pasting some part of the data collected.
> sdb was where the bricks resided in both 39 and 40. And the both the servers
> has p4p1 as network interface for 10G Ethernet link.
> 
> On average on Machine 39(source)
> disk read speed: 101128.40 KB/s =~ 100MB/s 
> network writes : 103045.13KB/s
> 
> On average on Machine 40(source)
> disk write speed: 104006.95 KB/s
> network reads: 104006.95 KB/s
> 
> I don't see any network/disk bandwidth congestion.
> 
> Please review once as I might have missed something.
> 
> Thanks,
> Susant

Thanks.

That data shows nothing is busy, we would need to dig into this more to see where the problem could lie.

Nithya and myself are going to run similar tests to determine were we are starved and to get closer to the RCA if possible.

Comment 29 Bipin Kunal 2016-11-30 18:18:04 UTC

Thanks Susant for info. I wanted brief information as there will be gss representative as well in the meeting. 

I know Nithya is aware of everything and she can update everything in the meeting but having information with GSS is also important.

Comment 31 Susant Kumar Palai 2017-01-03 06:30:47 UTC

Update:
   
Currently rebalance creates configured number of threads, which de-queues one entry from the global queue of migratable files filled by crawler. And submits each of the migration job as a synctask. But there is a overhead of synctasks. The tasks are queued in to the task list and threads pick up tasks from this list.

To eliminate this overhead, tried doing migrations as part of migration threads spawned at the beginning of rebalance.

And the results are interesting.
10threads, 50k files, 1MB each, master code:
---------------------------------------------
[root@gprfs039 nost]# gluster v rebalance nost status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:6:37
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:49
volume rebalance: nost: success


10threads, 50k files, 1MB each, no_synctasks:
------------------------------------------
[root@gprfs039 nost]# gluster v rebalance nost status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:03:27
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:00:14
Estimated time for rebalance to complete :        0:02:48
volume rebalance: nost: success
[root@gprfs039 nost]# 

There is a 100% surge in perf.  Will do another set of run to make sure the data is consistent.

Comment 32 Susant Kumar Palai 2017-01-24 11:35:56 UTC

Update:
    Patch for above perf enhancement: https://review.gluster.org/#/c/16427/
Tried the above test case and got similar numbers.

With patch:
[root@gprfs039 glusterfs]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:3:56
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:35
volume rebalance: test1: success


Without patch: (normal mode)
[root@gprfs039 glusterfs]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:7:29
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:36
volume rebalance: test1: success

Data speed was 250~ MBps.


What else to be done further?
- To eliminate disk contention from picture, will replace harddisks with ram-disk. Should there be any improvement with ram-disk setup, will run tests with multi-disk setup(possibly with sdds).

- Will need help from perf team for suggestion about possible bottle-necks.

Adding need info Manoj, so that he will be in loop for this thread.

-Susant

Comment 33 Susant Kumar Palai 2017-01-24 13:01:23 UTC

Forgot to add one more thing.

- Will run a program, which will do IO similar to rebalance. This will give idea about whether there can be further improvement. Will update results soon.

Comment 34 Manoj Pillai 2017-01-30 06:01:26 UTC

Susant, could you report numbers for all 3 throttle settings -- lazy, normal and agressive -- with the patch that eliminates synctasks from the migration path?

For each run please collect disk (iostat -N -tkdx 10), thread-level cpu (top -bH -d 10) and network stats (sar -n DEV 10), in order to check for bottlenecks.

Comment 35 Susant Kumar Palai 2017-01-30 06:14:20 UTC

(In reply to Manoj Pillai from comment #34)
> Susant, could you report numbers for all 3 throttle settings -- lazy, normal
> and agressive -- with the patch that eliminates synctasks from the migration
> path?
> 
> For each run please collect disk (iostat -N -tkdx 10), thread-level cpu (top
> -bH -d 10) and network stats (sar -n DEV 10), in order to check for
> bottlenecks.

Yes Manoj, should be done by EOD. In my previous runs I found the perf. numbers saturate at 5 threads. Will update once done.

Thanks,
Susant

Comment 36 Susant Kumar Palai 2017-01-31 04:14:13 UTC

Updating with the numbers for the 3 settings. Somehow iostat reports for normal and lazy have just one sample in them, so will have to run those test again.

perf numbers for all three settings:

20-threads(aggressive):
[root@gprfs039 ~]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:3:53
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:0:36
volume rebalance: test1: success


10 threads(normal):
[root@gprfs040 glusterfs]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes         50000             0             0            completed        0:0:33
                            172.17.40.39            50000        46.6GB         50000             0             0            completed        0:3:54
volume rebalance: test1: success


1 thread (lazy):
[root@gprfs039 ~]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            50000        46.6GB         50000             0             0            completed        0:16:47
                            172.17.40.40                0        0Bytes         50000             0             0            completed        0:1:27
volume rebalance: test1: success


Will create an attachment of all the reports those are available.

Comment 37 Susant Kumar Palai 2017-01-31 04:21:43 UTC

Created attachment 1246088 [details]
Disk/network/cpu stat for lazy,normal,aggressive

Comment 38 Susant Kumar Palai 2017-01-31 04:40:40 UTC

Created attachment 1246089 [details]
iostat-report-for-10threads(normal)

Comment 39 Manoj Pillai 2017-01-31 18:09:03 UTC

From the stats uploaded in comment #37 and comment #38, there is no visible CPU, network or disk saturation. 

Disk bandwidth during rebalance in normal or aggressive mode is < 220 MB/s. On sequential I/O the underlying RAID-6 device can give about 1GB/s. But since the file size here is 1MB, which is less than the RAID-6 data stripe size, we should not expect anything close to the peak here. I think it makes sense to gather data for a large file data set (say 1GB file size) as well to see how close we are able to get to 1GB/s. Or alternatively, estimate what throughput your storage h/w can give for 1MB file size.

In the collected stats, I see almost no disk activity on server 39. I'm guessing this is because data is already cached when rebalance starts. For a clearer picture, and to model customer case better, I'd suggest dropping caches before the run.

Also, for brick on RAID-6, I'd recommend server.event-threads=4, if you're not already using it.

Comment 40 Susant Kumar Palai 2017-02-02 11:01:50 UTC

Update (wrt comment 39, Thanks Manoj for pointing out the test case):

- Perf numbers for rebalance:

- With Patch https://review.gluster.org/#/c/16427/

- Aggressive (20-threads):
[root@gprfs039 glusterfs]# gluster v rebalance test2 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1001       954.6GB          1001             0             0            completed        0:23:41
                            172.17.40.40                0        0Bytes          1001             0             0            completed        0:0:1

- With existing code (synctasks):

- Aggressive (20 threads):
[root@gprfs039 with_syncop]# gluster v rebalance test2 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1001       954.6GB          1001             0             0            completed        0:56:3
                            172.17.40.40                0        0Bytes          1001             0             0            completed        0:0:1
volume rebalance: test2: success

So clearly there is a 100% perf improvement :) .

- With Patch:
Normal (10 threads):
[root@gprfs039 glusterfs]# gluster v rebalance test3 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1001       954.6GB          1001             0             0            completed        0:38:3
                            172.17.40.40                0        0Bytes          1001             0             0            completed        0:0:1
volume rebalance: test3: success

- With existing code:
Normal (10 threads):
[root@gprfs039 glusterfs]# grs
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1001       953.7GB          1001             0             0            completed        0:55:30
                            172.17.40.40                0        0Bytes          1001             0             0            completed        0:0:1
volume rebalance: test1: success


So the data says the objective to get more performance with _AGGRESSIVE_ mode compared to that of _NORMAL_ mode is achieved. And with the above patch we are able to hit the disk limit (700~900 MBps from io-stat report), which was the expectation.

Will be attaching the iostat/sar/cpu report for Manoj to analyze and update.


Thanks,
Susant

Comment 41 Susant Kumar Palai 2017-02-02 11:12:37 UTC

Created attachment 1247051 [details]
Rebalance perf report(iostat/sar/cpu) for 1GB data

Comment 42 Atin Mukherjee 2017-02-20 13:02:05 UTC

upstream patch : https://review.gluster.org/16427

Comment 44 Susant Kumar Palai 2017-02-23 12:51:58 UTC

Created attachment 1256908 [details]
Rebal-stat with 4 server run for server 45,46

Comment 45 Susant Kumar Palai 2017-02-23 12:53:42 UTC

Created attachment 1256909 [details]
Rebal-stat with 4 server run for server 47,48

Comment 46 Manoj Pillai 2017-03-01 07:34:31 UTC

Discussed results with Susant. For smaller file sizes (1MB and 128KB), disk utilization is well below saturation even with 20 threads, suggesting a bottleneck elsewhere. Susant reported improved rebalance times (and higher disk utilization) with a patch that improved crawler efficiency.

Comment 48 Susant Kumar Palai 2017-03-01 08:39:39 UTC

(In reply to Manoj Pillai from comment #46)
> Discussed results with Susant. For smaller file sizes (1MB and 128KB), disk
> utilization is well below saturation even with 20 threads, suggesting a
> bottleneck elsewhere. Susant reported improved rebalance times (and higher
> disk utilization) with a patch that improved crawler efficiency.

With the crawl improvement patch [1] here are the numbers.
setup: 4*1 to 8*1
file size: 1MB (1 million files)
                        normal      aggressive     lazy           files migrated
Patched                 0:36:59     0:36:22        2:08:19        249484
patched                 0:28:49	    0:24:44        not tried      249484
(File Crawl 
improvement
Patch)	

with crawl patch there is (for this setup) 1.5x improvement. Plus there is a perf difference between normal and aggressive mode. 



[1] https://review.gluster.org/#/c/15781/

Comment 49 Susant Kumar Palai 2017-03-01 08:49:06 UTC

Attaching isotat info for above run(with crawl patch).

Comment 50 Susant Kumar Palai 2017-03-01 08:51:13 UTC

Created attachment 1258605 [details]
iostat-report-for-20threads with crawl patch

Comment 64 errata-xmlrpc 2017-09-21 04:28:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 65 errata-xmlrpc 2017-09-21 04:54:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 66 Red Hat Bugzilla 2023-09-14 03:31:43 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.