1695480 – Global Thread Pool

Bug 1695480 - Global Thread Pool

Summary: Global Thread Pool

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Xavi Hernandez
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-03 08:09 UTC by Xavi Hernandez
Modified:	2020-03-12 13:00 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 13:00:31 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Xavi Hernandez 2019-04-03 08:09:22 UTC

Description of problem:

The Global Thread Pool provides lower contention and increased performance in some cases, but it has been observed that sometimes there's a huge increment in the number of requests going to the disks in parallel which seems to be causing a performance degradation.

Actually, it seems that sending the same amount of requests but from fewer threads is giving higher performance.

The current implementation already does some dynamic adjustment of the number of active threads based on the current number of requests, but it doesn't consider the load on the back-end file systems. This means that as long as more requests come, the number of threads is scaled accordingly, which could have a negative impact if the back-end is already saturated.

The way to control that in current version is to manually adjust the maximum number of threads that can be used, which effectively limits the load on back-end file systems even if more requests are coming, but this is only useful for volumes whose workload is homogeneous and constant.

To make it more versatile, the maximum number of threads need to be automatically self-adjusted to adapt dynamically to the current load so that it can be useful in a general case.

Version-Release number of selected component (if applicable): mainline


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Xavi Hernandez 2019-06-25 18:16:55 UTC

The autoscaling implementation was not giving the expected results. I implemented a new approach that works much better automatically. However it doesn't beats the current implementation in all cases.

Results can be found here: https://docs.google.com/spreadsheets/d/19JqvuFKZxKifgrhLF-5-bgemYj8XKldUox1QwsmGj2k/edit?usp=sharing

One thing that I've observed is that both the old and new implementations have a huge variation for some tests. I cannot explain why right now. My plan is to add instrumentation to gluster code to be able to analyze these cases in more detail and see where the delays come from.

Comment 2 Worker Ant 2020-03-12 13:00:31 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/983, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.