Bug 1877438 - many kidle_inject processes slowing things down
Summary: many kidle_inject processes slowing things down
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: thermald
Version: 33
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Benjamin Berg
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-09 15:51 UTC by Chris Murphy
Modified: 2020-09-14 08:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)
screenshot of top showing kidle_inject processes (265.31 KB, image/png)
2020-09-09 15:51 UTC, Chris Murphy
no flags Details
journal (2.21 MB, text/plain)
2020-09-09 16:01 UTC, Chris Murphy
no flags Details
journal2 (18.62 MB, text/plain)
2020-09-13 15:43 UTC, Chris Murphy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github intel thermal_daemon issues 271 0 None open Possible unstable thermal regulation on old MacBook (2011) 2020-10-19 15:26:05 UTC

Description Chris Murphy 2020-09-09 15:51:57 UTC
Created attachment 1714306 [details]
screenshot of top showing kidle_inject processes

Description of problem:

While compiling, I see multiple kidle_inject kernel processes significantly slowing down the system. This may be intended but it also seems too aggressive because the system really does slow to a crawl. It does recover after a few minutes.


Version-Release number of selected component (if applicable):
thermald-2.3-2.fc33.x86_64
kernel-5.9.0-0.rc4.5.fc34.x86_64+debug


How reproducible:
Always, as described


Steps to Reproduce:
1. Workstation 33 installation
2. Install kernel 5.9.0-0.rc4.5.fc34.x86_64+debug
3. Compile chromium

Actual results:

Many kidle_inject processes competing for CPU with regular processes, slowing things down.


Expected results:

I'm not sure. Maybe not clamp down on CPU this aggressively? In this case it's plugged into wall power and not on my lap because the laptop is too hot to have on my lap in this case.


Additional info:

I don't know if this happens with 5.8.x kernels.

Comment 1 Chris Murphy 2020-09-09 16:01:48 UTC
Created attachment 1714307 [details]
journal

--loglevel=debug is enabled on thermald

Comment 2 Chris Murphy 2020-09-09 16:05:39 UTC
By the way, the kidle_inject threads aren't always present. They come and go. I'm not certain if they coincide with any of the stop and start idle injection kernel messages because they arrive on scene with a low percentage, and gradually creep up to maybe 50% and then gradually get lower and then go away again.

Comment 3 Benjamin Berg 2020-09-09 16:14:06 UTC
Yeah, they will only be started when other ways to cool the system fail. And, looking at the log it does look like they get started when your CPU package reaches 94°C, so it really sounds like your machine is running way too hot at that point already.

Seriously, this is the order of cooling devices:
[ 1192.187027] thermald[2187]: - rapl_controller
[ 1192.187186] thermald[2187]: - intel_pstate
[ 1192.187341] thermald[2187]: - intel_powerclamp
[ 1192.187518] thermald[2187]: - cpufreq
[ 1192.187658] thermald[2187]: - Processor

intel_powerclamp is what is doing the idle injection. Before that, we have already limitted the Wattage of the CPU using RAPL (from 45W down to 22W) and turned off turbo boost.

What I am curious about is whether we might not be using active cooling devices sufficiently (i.e. could the fan spin up more).

Comment 4 Chris Murphy 2020-09-13 15:43:34 UTC
Created attachment 1714688 [details]
journal2

Journal time [  646.668403] to [18272.538535]

[17672.783059] is earlyoom kill, watch the cooldown happen (idle cpu should be >99%, busiest tasks are thermald+journald due to logging)


Temps at this time:

[17673.731822] thermald[774]: Sensor x86_pkg_temp :temp 93000
[17673.732069] thermald[774]: pref 0 type 3 temp 93000 trip 93000


about 1 minute later

[17731.820218] thermald[774]: Sensor x86_pkg_temp :temp 64000
[17731.820401] thermald[774]: pref 0 type 3 temp 64000 trip 93000


about 5 minutes later

[17972.136562] thermald[774]: Sensor x86_pkg_temp :temp 50000
[17972.136789] thermald[774]: pref 0 type 3 temp 50000 trip 93000


about 10 minutes later

[18272.537101] thermald[774]: Sensor x86_pkg_temp :temp 47000
[18272.537707] thermald[774]: pref 0 type 3 temp 47000 trip 93000


about 1 hour later
[21272.340628] thermald[774]: Sensor x86_pkg_temp :temp 43000
[21272.340931] thermald[774]: pref 0 type 3 temp 43000 trip 93000

Comment 5 Chris Murphy 2020-09-13 15:44:58 UTC
The test computer was fairly clean at the start for the original journal attachment; journal2 is following cleaning.

Comment 6 Benjamin Berg 2020-09-14 08:34:07 UTC
Well, journal2 looks better to me, so I think that the cooling capability has likely improved :)

Your machine says that the CPU package should always be allowed to draw around 22W. What seems to happen in principle is that the fans could not remove that much power. And rather than lowering the limit further using RAPL, thermald resorts to using intel_powerclamp (i.e. idle injection).

Another thing that I noticed is that thermald does not use the current power measurement in the regulation on your machine (which happens on some platforms). No idea if that would help though, I can imagine scenarios exists where RAPL throttling is too low initially and then thermald overreacts. But … that cannot explain the longer term idle injection that you were seeing.



Maybe we could collect more information about what happens when thermald is not running. Not sure of a good way to do that though, maybe just a dumb:

while sleep 4; do grep . /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj /sys/class/thermal/thermal_zone0/temp /sys/class/thermal/thermal_zone1/temp /sys/class/hwmon/hwmon3/temp1_input /sys/class/hwmon/hwmon3/temp2_input /sys/class/hwmon/hwmon3/temp3_input | logger; done

That would give us some temperature/power usage information. But, unfortunately I doubt it is worth spending a lot of energy on this :-/


Note You need to log in before you can comment on or make changes to this bug.