Created attachment 1714306 [details]
screenshot of top showing kidle_inject processes
Description of problem:
While compiling, I see multiple kidle_inject kernel processes significantly slowing down the system. This may be intended but it also seems too aggressive because the system really does slow to a crawl. It does recover after a few minutes.
Version-Release number of selected component (if applicable):
Always, as described
Steps to Reproduce:
1. Workstation 33 installation
2. Install kernel 5.9.0-0.rc4.5.fc34.x86_64+debug
3. Compile chromium
Many kidle_inject processes competing for CPU with regular processes, slowing things down.
I'm not sure. Maybe not clamp down on CPU this aggressively? In this case it's plugged into wall power and not on my lap because the laptop is too hot to have on my lap in this case.
I don't know if this happens with 5.8.x kernels.
Created attachment 1714307 [details]
--loglevel=debug is enabled on thermald
By the way, the kidle_inject threads aren't always present. They come and go. I'm not certain if they coincide with any of the stop and start idle injection kernel messages because they arrive on scene with a low percentage, and gradually creep up to maybe 50% and then gradually get lower and then go away again.
Yeah, they will only be started when other ways to cool the system fail. And, looking at the log it does look like they get started when your CPU package reaches 94°C, so it really sounds like your machine is running way too hot at that point already.
Seriously, this is the order of cooling devices:
[ 1192.187027] thermald: - rapl_controller
[ 1192.187186] thermald: - intel_pstate
[ 1192.187341] thermald: - intel_powerclamp
[ 1192.187518] thermald: - cpufreq
[ 1192.187658] thermald: - Processor
intel_powerclamp is what is doing the idle injection. Before that, we have already limitted the Wattage of the CPU using RAPL (from 45W down to 22W) and turned off turbo boost.
What I am curious about is whether we might not be using active cooling devices sufficiently (i.e. could the fan spin up more).
Created attachment 1714688 [details]
Journal time [ 646.668403] to [18272.538535]
[17672.783059] is earlyoom kill, watch the cooldown happen (idle cpu should be >99%, busiest tasks are thermald+journald due to logging)
Temps at this time:
[17673.731822] thermald: Sensor x86_pkg_temp :temp 93000
[17673.732069] thermald: pref 0 type 3 temp 93000 trip 93000
about 1 minute later
[17731.820218] thermald: Sensor x86_pkg_temp :temp 64000
[17731.820401] thermald: pref 0 type 3 temp 64000 trip 93000
about 5 minutes later
[17972.136562] thermald: Sensor x86_pkg_temp :temp 50000
[17972.136789] thermald: pref 0 type 3 temp 50000 trip 93000
about 10 minutes later
[18272.537101] thermald: Sensor x86_pkg_temp :temp 47000
[18272.537707] thermald: pref 0 type 3 temp 47000 trip 93000
about 1 hour later
[21272.340628] thermald: Sensor x86_pkg_temp :temp 43000
[21272.340931] thermald: pref 0 type 3 temp 43000 trip 93000
The test computer was fairly clean at the start for the original journal attachment; journal2 is following cleaning.
Well, journal2 looks better to me, so I think that the cooling capability has likely improved :)
Your machine says that the CPU package should always be allowed to draw around 22W. What seems to happen in principle is that the fans could not remove that much power. And rather than lowering the limit further using RAPL, thermald resorts to using intel_powerclamp (i.e. idle injection).
Another thing that I noticed is that thermald does not use the current power measurement in the regulation on your machine (which happens on some platforms). No idea if that would help though, I can imagine scenarios exists where RAPL throttling is too low initially and then thermald overreacts. But … that cannot explain the longer term idle injection that you were seeing.
Maybe we could collect more information about what happens when thermald is not running. Not sure of a good way to do that though, maybe just a dumb:
while sleep 4; do grep . /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj /sys/class/thermal/thermal_zone0/temp /sys/class/thermal/thermal_zone1/temp /sys/class/hwmon/hwmon3/temp1_input /sys/class/hwmon/hwmon3/temp2_input /sys/class/hwmon/hwmon3/temp3_input | logger; done
That would give us some temperature/power usage information. But, unfortunately I doubt it is worth spending a lot of energy on this :-/