Bug 1877438 - many kidle_inject processes slowing things down
Summary: many kidle_inject processes slowing things down
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: thermald
Version: 33
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Benjamin Berg
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-09 15:51 UTC by Chris Murphy
Modified: 2021-11-30 17:58 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-30 16:24:20 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
screenshot of top showing kidle_inject processes (265.31 KB, image/png)
2020-09-09 15:51 UTC, Chris Murphy
no flags Details
journal (2.21 MB, text/plain)
2020-09-09 16:01 UTC, Chris Murphy
no flags Details
journal2 (18.62 MB, text/plain)
2020-09-13 15:43 UTC, Chris Murphy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github intel thermal_daemon issues 271 0 None open Possible unstable thermal regulation on old MacBook (2011) 2020-10-19 15:26:05 UTC

Description Chris Murphy 2020-09-09 15:51:57 UTC
Created attachment 1714306 [details]
screenshot of top showing kidle_inject processes

Description of problem:

While compiling, I see multiple kidle_inject kernel processes significantly slowing down the system. This may be intended but it also seems too aggressive because the system really does slow to a crawl. It does recover after a few minutes.


Version-Release number of selected component (if applicable):
thermald-2.3-2.fc33.x86_64
kernel-5.9.0-0.rc4.5.fc34.x86_64+debug


How reproducible:
Always, as described


Steps to Reproduce:
1. Workstation 33 installation
2. Install kernel 5.9.0-0.rc4.5.fc34.x86_64+debug
3. Compile chromium

Actual results:

Many kidle_inject processes competing for CPU with regular processes, slowing things down.


Expected results:

I'm not sure. Maybe not clamp down on CPU this aggressively? In this case it's plugged into wall power and not on my lap because the laptop is too hot to have on my lap in this case.


Additional info:

I don't know if this happens with 5.8.x kernels.

Comment 1 Chris Murphy 2020-09-09 16:01:48 UTC
Created attachment 1714307 [details]
journal

--loglevel=debug is enabled on thermald

Comment 2 Chris Murphy 2020-09-09 16:05:39 UTC
By the way, the kidle_inject threads aren't always present. They come and go. I'm not certain if they coincide with any of the stop and start idle injection kernel messages because they arrive on scene with a low percentage, and gradually creep up to maybe 50% and then gradually get lower and then go away again.

Comment 3 Benjamin Berg 2020-09-09 16:14:06 UTC
Yeah, they will only be started when other ways to cool the system fail. And, looking at the log it does look like they get started when your CPU package reaches 94°C, so it really sounds like your machine is running way too hot at that point already.

Seriously, this is the order of cooling devices:
[ 1192.187027] thermald[2187]: - rapl_controller
[ 1192.187186] thermald[2187]: - intel_pstate
[ 1192.187341] thermald[2187]: - intel_powerclamp
[ 1192.187518] thermald[2187]: - cpufreq
[ 1192.187658] thermald[2187]: - Processor

intel_powerclamp is what is doing the idle injection. Before that, we have already limitted the Wattage of the CPU using RAPL (from 45W down to 22W) and turned off turbo boost.

What I am curious about is whether we might not be using active cooling devices sufficiently (i.e. could the fan spin up more).

Comment 4 Chris Murphy 2020-09-13 15:43:34 UTC
Created attachment 1714688 [details]
journal2

Journal time [  646.668403] to [18272.538535]

[17672.783059] is earlyoom kill, watch the cooldown happen (idle cpu should be >99%, busiest tasks are thermald+journald due to logging)


Temps at this time:

[17673.731822] thermald[774]: Sensor x86_pkg_temp :temp 93000
[17673.732069] thermald[774]: pref 0 type 3 temp 93000 trip 93000


about 1 minute later

[17731.820218] thermald[774]: Sensor x86_pkg_temp :temp 64000
[17731.820401] thermald[774]: pref 0 type 3 temp 64000 trip 93000


about 5 minutes later

[17972.136562] thermald[774]: Sensor x86_pkg_temp :temp 50000
[17972.136789] thermald[774]: pref 0 type 3 temp 50000 trip 93000


about 10 minutes later

[18272.537101] thermald[774]: Sensor x86_pkg_temp :temp 47000
[18272.537707] thermald[774]: pref 0 type 3 temp 47000 trip 93000


about 1 hour later
[21272.340628] thermald[774]: Sensor x86_pkg_temp :temp 43000
[21272.340931] thermald[774]: pref 0 type 3 temp 43000 trip 93000

Comment 5 Chris Murphy 2020-09-13 15:44:58 UTC
The test computer was fairly clean at the start for the original journal attachment; journal2 is following cleaning.

Comment 6 Benjamin Berg 2020-09-14 08:34:07 UTC
Well, journal2 looks better to me, so I think that the cooling capability has likely improved :)

Your machine says that the CPU package should always be allowed to draw around 22W. What seems to happen in principle is that the fans could not remove that much power. And rather than lowering the limit further using RAPL, thermald resorts to using intel_powerclamp (i.e. idle injection).

Another thing that I noticed is that thermald does not use the current power measurement in the regulation on your machine (which happens on some platforms). No idea if that would help though, I can imagine scenarios exists where RAPL throttling is too low initially and then thermald overreacts. But … that cannot explain the longer term idle injection that you were seeing.



Maybe we could collect more information about what happens when thermald is not running. Not sure of a good way to do that though, maybe just a dumb:

while sleep 4; do grep . /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj /sys/class/thermal/thermal_zone0/temp /sys/class/thermal/thermal_zone1/temp /sys/class/hwmon/hwmon3/temp1_input /sys/class/hwmon/hwmon3/temp2_input /sys/class/hwmon/hwmon3/temp3_input | logger; done

That would give us some temperature/power usage information. But, unfortunately I doubt it is worth spending a lot of energy on this :-/

Comment 7 Ben Cotton 2021-11-04 17:32:21 UTC
This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 33 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 8 Ben Cotton 2021-11-30 16:24:20 UTC
Fedora 33 changed to end-of-life (EOL) status on 2021-11-30. Fedora 33 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 9 Michael Catanzaro 2021-11-30 17:58:30 UTC
Chris, is this still relevant?


Note You need to log in before you can comment on or make changes to this bug.