Bug 2140203

Summary: tuned throughput-performance's scheduler plugin usage yields high CPU usage
Product: Red Hat Enterprise Linux 8 Reporter: Georg Sauthoff <georg.sauthoff>
Component: tunedAssignee: Jaroslav Škarvada <jskarvad>
Status: NEW --- QA Contact: Robin Hack <rhack>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.6CC: dsaha, duge, gnaik, jeder, jskarvad, mmatsuya
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Georg Sauthoff 2022-11-04 19:28:42 UTC
Description of problem:
With RHEL 8.6 the tuned throughput-performance profile uses the scheduler plugin for some settings for which it used the sysctl plugin, before (e.g. in RHEL 7.9).

Version-Release number of selected component (if applicable):
tuned-2.18.0-2.el8_6.1.noarch

How reproducible:
always


Steps to Reproduce:
1. make sure that the throughput-performance tuned profile is activated (otherwise: `tuned-adm profile throughput-performance`)
2. increase the fork-rate of the system (until the tuned process uses 30 % CPU or more)
3. `perf trace -s -p $(pgrep tuned) -- sleep 60`

Actual results:
tuned CPU usage increases with the fork rate, easily up to 30 % and more
perf trace output shows high syscall rates for one tuned thread, i.e. for poll(), read(), openat(), lseek(), ioctl(), close() and fstat()

Expected results:
tuned CPU usage is very low (just a few percent) and is independent of the fork rate of the system.

Additional info:
This is caused by how the scheduler plugin polls for process creation events, even when the plugin's usage doesn't contain any process matching declarations, as with the throughput-performance profile. Each such event is then amplified by tuned invoking multiple syscalls on pseudo files under /proc/$pid/.

Looking at a syscall trace in detail shows that a bunch of syscalls to read files under /proc/$pid/ is superfluous or even pointless (even if there were process matching declarations in the config), e.g.:

```
196436 openat(AT_FDCWD, "/proc/3678736/cmdline", O_RDONLY|O_CLOEXEC) = 28</proc/3678736/cmdline>
196436 fstat(28</proc/3678736/cmdline>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
196436 ioctl(28</proc/3678736/cmdline>, TCGETS, 0x7f1113ffd410) = -1 ENOTTY (Inappropriate ioctl for device)
196436 lseek(28</proc/3678736/cmdline>, 0, SEEK_CUR) = 0
196436 ioctl(28</proc/3678736/cmdline>, TCGETS, 0x7f1113ffd3f0) = -1 ENOTTY (Inappropriate ioctl for device)
196436 lseek(28</proc/3678736/cmdline>, 0, SEEK_CUR) = 0
196436 read(28</proc/3678736/cmdline>, "/opt/xyz/bin/foobar\0foobar\0", 8192) = 23
196436 read(28</proc/3678736/cmdline>, "", 8192) = 0
196436 close(28</proc/3678736/cmdline>) = 0
```


A simple fix for the throughput-performance profile (which is activated, by default, on RHEL systems) is to convert the scheduler plugin settings back to sysctl ones, e.g. like this:

```
--- /usr/lib/tuned/throughput-performance/tuned.conf    2022-06-08 11:48:16.000000000 +0200
+++ new/throughput-performance/tuned.conf  2022-11-04 18:03:05.468461294 +0100
@@ -58,12 +58,11 @@
 # and move them to swap cache
 vm.swappiness=10
 
-[scheduler]
 # ktune sysctl settings for rhel6 servers, maximizing i/o throughput
 #
 # Minimal preemption granularity for CPU-bound tasks:
 # (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
-sched_min_granularity_ns = 10000000
+kernel.sched_min_granularity_ns = 10000000
 
 # SCHED_OTHER wake-up granularity.
 # (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
@@ -71,7 +70,7 @@
 # This option delays the preemption effects of decoupled workloads
 # and reduces their over-scheduling. Synchronous workloads will still
 # have immediate wakeup/sleep latencies.
-sched_wakeup_granularity_ns = 15000000
+kernel.sched_wakeup_granularity_ns = 15000000
 
 # Marvell ThunderX
 [sysctl.thunderx]
@@ -81,8 +80,8 @@
 kernel.numa_balancing=0
 
 # AMD
-[scheduler.amd]
-type=scheduler
+[sysctl.amd]
+type=sysctl
 uname_regex=x86_64
 cpuinfo_regex=${amd_cpuinfo_regex}
-sched_migration_cost_ns=5000000
+kernel.sched_migration_cost_ns=5000000
```

Comment 3 Jaroslav Škarvada 2023-03-23 14:11:33 UTC
The CPU usage of the scheduler plugin depends on the number of the forked processes, because by default the scheduler plugin (if enabled and configured) classifies and tunes newly created/forked processes. While we are working on the optimization of the "process classificator" the underlying code depends on the 'perf' (performance counters) subsystem and its python bindings. Thus it has performance limits. In case customer don't need TuneD to tune newly created processes (mostly process priority and scheduler) she/he/they can run with this feature disabled which will improve performance in case of the heavy process forks. To configure it this way add 'runtime=0' to the 'scheduler' plugin in the TuneD profile, i.e.:
[scheduler]
runtime=0

It's also possible to create custom TuneD overlay profile with this setting that will change behavior of the stock TuneD profile we ship.

Comment 5 Georg Sauthoff 2023-05-24 20:07:39 UTC
Well, why don't you add runtime=0 to the throughput-performance profile then?

Or even better: why don't you simply set the two sched_min_granularity_ns/sched_wakeup_granularity_ns settings via the sysctl plugin?

I mean, as-is, the scheduler plugin seems to be primarily designed very much for changing scheduling parameters of processes as they are created.
Which is great if you need it.

But in the throughput-performance profile you don't use it that way, at all.
You use it for **system-wide** settings - which are sysctls, for which the sysctl plugin already exists, which worked fine in RHEL7 and which still works great in RHEL8.


AFAICS, there are really no advantages of introducing the scheduler plugin in the throughput-performance profile, at all!
There is just the big disadvantage of wasting significant CPU on a system that forks a lot!

So what's the point of introducing a regression like this into the throughput-performance profile (that is activated, default)?


Sure, I could write and maintain my own throughput-performance-fixed profile which derives from throughput-performance via `include=throughput-performance` and set `runtime=0` to work-around your regression.

But this would be pointless work!


Just fix it at the source.