Bug 2140203
| Summary: | tuned throughput-performance's scheduler plugin usage yields high CPU usage | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Georg Sauthoff <georg.sauthoff> |
| Component: | tuned | Assignee: | Jaroslav Škarvada <jskarvad> |
| Status: | NEW --- | QA Contact: | Robin Hack <rhack> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.6 | CC: | dsaha, duge, gnaik, jeder, jskarvad, mmatsuya |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The CPU usage of the scheduler plugin depends on the number of the forked processes, because by default the scheduler plugin (if enabled and configured) classifies and tunes newly created/forked processes. While we are working on the optimization of the "process classificator" the underlying code depends on the 'perf' (performance counters) subsystem and its python bindings. Thus it has performance limits. In case customer don't need TuneD to tune newly created processes (mostly process priority and scheduler) she/he/they can run with this feature disabled which will improve performance in case of the heavy process forks. To configure it this way add 'runtime=0' to the 'scheduler' plugin in the TuneD profile, i.e.: [scheduler] runtime=0 It's also possible to create custom TuneD overlay profile with this setting that will change behavior of the stock TuneD profile we ship. Well, why don't you add runtime=0 to the throughput-performance profile then? Or even better: why don't you simply set the two sched_min_granularity_ns/sched_wakeup_granularity_ns settings via the sysctl plugin? I mean, as-is, the scheduler plugin seems to be primarily designed very much for changing scheduling parameters of processes as they are created. Which is great if you need it. But in the throughput-performance profile you don't use it that way, at all. You use it for **system-wide** settings - which are sysctls, for which the sysctl plugin already exists, which worked fine in RHEL7 and which still works great in RHEL8. AFAICS, there are really no advantages of introducing the scheduler plugin in the throughput-performance profile, at all! There is just the big disadvantage of wasting significant CPU on a system that forks a lot! So what's the point of introducing a regression like this into the throughput-performance profile (that is activated, default)? Sure, I could write and maintain my own throughput-performance-fixed profile which derives from throughput-performance via `include=throughput-performance` and set `runtime=0` to work-around your regression. But this would be pointless work! Just fix it at the source. |
Description of problem: With RHEL 8.6 the tuned throughput-performance profile uses the scheduler plugin for some settings for which it used the sysctl plugin, before (e.g. in RHEL 7.9). Version-Release number of selected component (if applicable): tuned-2.18.0-2.el8_6.1.noarch How reproducible: always Steps to Reproduce: 1. make sure that the throughput-performance tuned profile is activated (otherwise: `tuned-adm profile throughput-performance`) 2. increase the fork-rate of the system (until the tuned process uses 30 % CPU or more) 3. `perf trace -s -p $(pgrep tuned) -- sleep 60` Actual results: tuned CPU usage increases with the fork rate, easily up to 30 % and more perf trace output shows high syscall rates for one tuned thread, i.e. for poll(), read(), openat(), lseek(), ioctl(), close() and fstat() Expected results: tuned CPU usage is very low (just a few percent) and is independent of the fork rate of the system. Additional info: This is caused by how the scheduler plugin polls for process creation events, even when the plugin's usage doesn't contain any process matching declarations, as with the throughput-performance profile. Each such event is then amplified by tuned invoking multiple syscalls on pseudo files under /proc/$pid/. Looking at a syscall trace in detail shows that a bunch of syscalls to read files under /proc/$pid/ is superfluous or even pointless (even if there were process matching declarations in the config), e.g.: ``` 196436 openat(AT_FDCWD, "/proc/3678736/cmdline", O_RDONLY|O_CLOEXEC) = 28</proc/3678736/cmdline> 196436 fstat(28</proc/3678736/cmdline>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 196436 ioctl(28</proc/3678736/cmdline>, TCGETS, 0x7f1113ffd410) = -1 ENOTTY (Inappropriate ioctl for device) 196436 lseek(28</proc/3678736/cmdline>, 0, SEEK_CUR) = 0 196436 ioctl(28</proc/3678736/cmdline>, TCGETS, 0x7f1113ffd3f0) = -1 ENOTTY (Inappropriate ioctl for device) 196436 lseek(28</proc/3678736/cmdline>, 0, SEEK_CUR) = 0 196436 read(28</proc/3678736/cmdline>, "/opt/xyz/bin/foobar\0foobar\0", 8192) = 23 196436 read(28</proc/3678736/cmdline>, "", 8192) = 0 196436 close(28</proc/3678736/cmdline>) = 0 ``` A simple fix for the throughput-performance profile (which is activated, by default, on RHEL systems) is to convert the scheduler plugin settings back to sysctl ones, e.g. like this: ``` --- /usr/lib/tuned/throughput-performance/tuned.conf 2022-06-08 11:48:16.000000000 +0200 +++ new/throughput-performance/tuned.conf 2022-11-04 18:03:05.468461294 +0100 @@ -58,12 +58,11 @@ # and move them to swap cache vm.swappiness=10 -[scheduler] # ktune sysctl settings for rhel6 servers, maximizing i/o throughput # # Minimal preemption granularity for CPU-bound tasks: # (default: 1 msec# (1 + ilog(ncpus)), units: nanoseconds) -sched_min_granularity_ns = 10000000 +kernel.sched_min_granularity_ns = 10000000 # SCHED_OTHER wake-up granularity. # (default: 1 msec# (1 + ilog(ncpus)), units: nanoseconds) @@ -71,7 +70,7 @@ # This option delays the preemption effects of decoupled workloads # and reduces their over-scheduling. Synchronous workloads will still # have immediate wakeup/sleep latencies. -sched_wakeup_granularity_ns = 15000000 +kernel.sched_wakeup_granularity_ns = 15000000 # Marvell ThunderX [sysctl.thunderx] @@ -81,8 +80,8 @@ kernel.numa_balancing=0 # AMD -[scheduler.amd] -type=scheduler +[sysctl.amd] +type=sysctl uname_regex=x86_64 cpuinfo_regex=${amd_cpuinfo_regex} -sched_migration_cost_ns=5000000 +kernel.sched_migration_cost_ns=5000000 ```