Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2140203

Summary: tuned throughput-performance's scheduler plugin usage yields high CPU usage
Product: Red Hat Enterprise Linux 8 Reporter: Georg Sauthoff <georg.sauthoff>
Component: tunedAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED MIGRATED QA Contact: Robin Hack <rhack>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.6CC: dsaha, duge, gnaik, jeder, jskarvad, mmatsuya
Target Milestone: rcKeywords: MigratedToJIRA
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-09-21 22:07:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Georg Sauthoff 2022-11-04 19:28:42 UTC
Description of problem:
With RHEL 8.6 the tuned throughput-performance profile uses the scheduler plugin for some settings for which it used the sysctl plugin, before (e.g. in RHEL 7.9).

Version-Release number of selected component (if applicable):
tuned-2.18.0-2.el8_6.1.noarch

How reproducible:
always


Steps to Reproduce:
1. make sure that the throughput-performance tuned profile is activated (otherwise: `tuned-adm profile throughput-performance`)
2. increase the fork-rate of the system (until the tuned process uses 30 % CPU or more)
3. `perf trace -s -p $(pgrep tuned) -- sleep 60`

Actual results:
tuned CPU usage increases with the fork rate, easily up to 30 % and more
perf trace output shows high syscall rates for one tuned thread, i.e. for poll(), read(), openat(), lseek(), ioctl(), close() and fstat()

Expected results:
tuned CPU usage is very low (just a few percent) and is independent of the fork rate of the system.

Additional info:
This is caused by how the scheduler plugin polls for process creation events, even when the plugin's usage doesn't contain any process matching declarations, as with the throughput-performance profile. Each such event is then amplified by tuned invoking multiple syscalls on pseudo files under /proc/$pid/.

Looking at a syscall trace in detail shows that a bunch of syscalls to read files under /proc/$pid/ is superfluous or even pointless (even if there were process matching declarations in the config), e.g.:

```
196436 openat(AT_FDCWD, "/proc/3678736/cmdline", O_RDONLY|O_CLOEXEC) = 28</proc/3678736/cmdline>
196436 fstat(28</proc/3678736/cmdline>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
196436 ioctl(28</proc/3678736/cmdline>, TCGETS, 0x7f1113ffd410) = -1 ENOTTY (Inappropriate ioctl for device)
196436 lseek(28</proc/3678736/cmdline>, 0, SEEK_CUR) = 0
196436 ioctl(28</proc/3678736/cmdline>, TCGETS, 0x7f1113ffd3f0) = -1 ENOTTY (Inappropriate ioctl for device)
196436 lseek(28</proc/3678736/cmdline>, 0, SEEK_CUR) = 0
196436 read(28</proc/3678736/cmdline>, "/opt/xyz/bin/foobar\0foobar\0", 8192) = 23
196436 read(28</proc/3678736/cmdline>, "", 8192) = 0
196436 close(28</proc/3678736/cmdline>) = 0
```


A simple fix for the throughput-performance profile (which is activated, by default, on RHEL systems) is to convert the scheduler plugin settings back to sysctl ones, e.g. like this:

```
--- /usr/lib/tuned/throughput-performance/tuned.conf    2022-06-08 11:48:16.000000000 +0200
+++ new/throughput-performance/tuned.conf  2022-11-04 18:03:05.468461294 +0100
@@ -58,12 +58,11 @@
 # and move them to swap cache
 vm.swappiness=10
 
-[scheduler]
 # ktune sysctl settings for rhel6 servers, maximizing i/o throughput
 #
 # Minimal preemption granularity for CPU-bound tasks:
 # (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
-sched_min_granularity_ns = 10000000
+kernel.sched_min_granularity_ns = 10000000
 
 # SCHED_OTHER wake-up granularity.
 # (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
@@ -71,7 +70,7 @@
 # This option delays the preemption effects of decoupled workloads
 # and reduces their over-scheduling. Synchronous workloads will still
 # have immediate wakeup/sleep latencies.
-sched_wakeup_granularity_ns = 15000000
+kernel.sched_wakeup_granularity_ns = 15000000
 
 # Marvell ThunderX
 [sysctl.thunderx]
@@ -81,8 +80,8 @@
 kernel.numa_balancing=0
 
 # AMD
-[scheduler.amd]
-type=scheduler
+[sysctl.amd]
+type=sysctl
 uname_regex=x86_64
 cpuinfo_regex=${amd_cpuinfo_regex}
-sched_migration_cost_ns=5000000
+kernel.sched_migration_cost_ns=5000000
```

Comment 3 Jaroslav Škarvada 2023-03-23 14:11:33 UTC
The CPU usage of the scheduler plugin depends on the number of the forked processes, because by default the scheduler plugin (if enabled and configured) classifies and tunes newly created/forked processes. While we are working on the optimization of the "process classificator" the underlying code depends on the 'perf' (performance counters) subsystem and its python bindings. Thus it has performance limits. In case customer don't need TuneD to tune newly created processes (mostly process priority and scheduler) she/he/they can run with this feature disabled which will improve performance in case of the heavy process forks. To configure it this way add 'runtime=0' to the 'scheduler' plugin in the TuneD profile, i.e.:
[scheduler]
runtime=0

It's also possible to create custom TuneD overlay profile with this setting that will change behavior of the stock TuneD profile we ship.

Comment 5 Georg Sauthoff 2023-05-24 20:07:39 UTC
Well, why don't you add runtime=0 to the throughput-performance profile then?

Or even better: why don't you simply set the two sched_min_granularity_ns/sched_wakeup_granularity_ns settings via the sysctl plugin?

I mean, as-is, the scheduler plugin seems to be primarily designed very much for changing scheduling parameters of processes as they are created.
Which is great if you need it.

But in the throughput-performance profile you don't use it that way, at all.
You use it for **system-wide** settings - which are sysctls, for which the sysctl plugin already exists, which worked fine in RHEL7 and which still works great in RHEL8.


AFAICS, there are really no advantages of introducing the scheduler plugin in the throughput-performance profile, at all!
There is just the big disadvantage of wasting significant CPU on a system that forks a lot!

So what's the point of introducing a regression like this into the throughput-performance profile (that is activated, default)?


Sure, I could write and maintain my own throughput-performance-fixed profile which derives from throughput-performance via `include=throughput-performance` and set `runtime=0` to work-around your regression.

But this would be pointless work!


Just fix it at the source.

Comment 7 RHEL Program Management 2023-09-21 22:07:04 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 8 RHEL Program Management 2023-09-21 22:07:34 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.