Created attachment 1714306 [details] screenshot of top showing kidle_inject processes Description of problem: While compiling, I see multiple kidle_inject kernel processes significantly slowing down the system. This may be intended but it also seems too aggressive because the system really does slow to a crawl. It does recover after a few minutes. Version-Release number of selected component (if applicable): thermald-2.3-2.fc33.x86_64 kernel-5.9.0-0.rc4.5.fc34.x86_64+debug How reproducible: Always, as described Steps to Reproduce: 1. Workstation 33 installation 2. Install kernel 5.9.0-0.rc4.5.fc34.x86_64+debug 3. Compile chromium Actual results: Many kidle_inject processes competing for CPU with regular processes, slowing things down. Expected results: I'm not sure. Maybe not clamp down on CPU this aggressively? In this case it's plugged into wall power and not on my lap because the laptop is too hot to have on my lap in this case. Additional info: I don't know if this happens with 5.8.x kernels.
Created attachment 1714307 [details] journal --loglevel=debug is enabled on thermald
By the way, the kidle_inject threads aren't always present. They come and go. I'm not certain if they coincide with any of the stop and start idle injection kernel messages because they arrive on scene with a low percentage, and gradually creep up to maybe 50% and then gradually get lower and then go away again.
Yeah, they will only be started when other ways to cool the system fail. And, looking at the log it does look like they get started when your CPU package reaches 94°C, so it really sounds like your machine is running way too hot at that point already. Seriously, this is the order of cooling devices: [ 1192.187027] thermald[2187]: - rapl_controller [ 1192.187186] thermald[2187]: - intel_pstate [ 1192.187341] thermald[2187]: - intel_powerclamp [ 1192.187518] thermald[2187]: - cpufreq [ 1192.187658] thermald[2187]: - Processor intel_powerclamp is what is doing the idle injection. Before that, we have already limitted the Wattage of the CPU using RAPL (from 45W down to 22W) and turned off turbo boost. What I am curious about is whether we might not be using active cooling devices sufficiently (i.e. could the fan spin up more).
Created attachment 1714688 [details] journal2 Journal time [ 646.668403] to [18272.538535] [17672.783059] is earlyoom kill, watch the cooldown happen (idle cpu should be >99%, busiest tasks are thermald+journald due to logging) Temps at this time: [17673.731822] thermald[774]: Sensor x86_pkg_temp :temp 93000 [17673.732069] thermald[774]: pref 0 type 3 temp 93000 trip 93000 about 1 minute later [17731.820218] thermald[774]: Sensor x86_pkg_temp :temp 64000 [17731.820401] thermald[774]: pref 0 type 3 temp 64000 trip 93000 about 5 minutes later [17972.136562] thermald[774]: Sensor x86_pkg_temp :temp 50000 [17972.136789] thermald[774]: pref 0 type 3 temp 50000 trip 93000 about 10 minutes later [18272.537101] thermald[774]: Sensor x86_pkg_temp :temp 47000 [18272.537707] thermald[774]: pref 0 type 3 temp 47000 trip 93000 about 1 hour later [21272.340628] thermald[774]: Sensor x86_pkg_temp :temp 43000 [21272.340931] thermald[774]: pref 0 type 3 temp 43000 trip 93000
The test computer was fairly clean at the start for the original journal attachment; journal2 is following cleaning.
Well, journal2 looks better to me, so I think that the cooling capability has likely improved :) Your machine says that the CPU package should always be allowed to draw around 22W. What seems to happen in principle is that the fans could not remove that much power. And rather than lowering the limit further using RAPL, thermald resorts to using intel_powerclamp (i.e. idle injection). Another thing that I noticed is that thermald does not use the current power measurement in the regulation on your machine (which happens on some platforms). No idea if that would help though, I can imagine scenarios exists where RAPL throttling is too low initially and then thermald overreacts. But … that cannot explain the longer term idle injection that you were seeing. Maybe we could collect more information about what happens when thermald is not running. Not sure of a good way to do that though, maybe just a dumb: while sleep 4; do grep . /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj /sys/class/thermal/thermal_zone0/temp /sys/class/thermal/thermal_zone1/temp /sys/class/hwmon/hwmon3/temp1_input /sys/class/hwmon/hwmon3/temp2_input /sys/class/hwmon/hwmon3/temp3_input | logger; done That would give us some temperature/power usage information. But, unfortunately I doubt it is worth spending a lot of energy on this :-/
This message is a reminder that Fedora 33 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '33'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 33 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 33 changed to end-of-life (EOL) status on 2021-11-30. Fedora 33 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.
Chris, is this still relevant?