Created attachment 1813183 [details] kernel logs 1. Please describe the problem: I'm experiencing this in my Intel-based laptop (LG Gram). When the CPU is idle and cool, so that the CPU fan is off, if I start a CPU-demanding load (such as a compilation), the processor quickly overheats reaching the critical temperature before the fan can reach the maximum speed, and the kernel triggers a shutdown. It started to happen with the 5.13.x series. Briefly discussed here: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/YGESVMI6SIDMLRJKJEXJ7R3TEESS7BHU/ - thermald is not installed. - intel_tcc_cooling is loaded, but removing it does not help. 2. What is the Version-Release number of the kernel: It happens with the 5.13.x series. Tested with .4, .5 and .8. 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : No issue with previous kernels. I'm currently running 5.12.7-300.fc34.x86_64 with no issues: the fan reaches maximum speed quickly enough to control the temperature. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: - Suspend the laptop and wait a few minutes until it cools down. - Resume the session. - Launch a compilation task when the sensors' output shows a temperature of ~40ºC for the processor. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Not tested yet. 6. Are you running any modules that not shipped with directly Fedora's kernel?: No. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Log attached.
The issue persists with kernel 5.14.0-0.rc5.20210813gitf8e6dfc64f61.46.fc36. The only difference is that I get an additional line in the logs compared to 5.13 (the second one below): ago 16 22:49:50 kernel: thermal thermal_zone0: acpitz: critical temperature reached, shutting down ago 16 22:49:50 kernel: reboot: HARDWARE PROTECTION shutdown (Temperature too high) The laptop is basically unusable with any kernel >= 5.13.
I assume if you append thermal.off=1 to the grub command line, this goes away?
I don't know. I'm not willing to risk the computer. It shuts down *because* the CPU is actually reaching the critical temperature.
Force the fan to run at full instead of auto and see if it still shuts down. If so, it means your laptop is manufactured in a way as to not be able to handle the actual thermal load (not as uncommon as you think). If not, it means we have issues with kernel thermal management. Another thing worth trying, is kernel-5.12.19. https://koji.fedoraproject.org/koji/buildinfo?buildID=1782372 There were no acpi thermal updates to 5.13 at all, but a few fan updates did show up in 5.14 merge window and got backported for stable with 5.13.2 and 5.12.17. It would certainly help narrow down the patch that brought this forward. The data with the current (bad) kernel and the fans forced to full is still an interesting data point though, as it could be that the new patches are correct, and the hardware requires some specific finesse to keep it clocked lower.
Any idea how to force the fan at full speed? I see no way of controlling it, and pwmconfig says that "there are no pwm-capable sensor modules installed".
No issues with kernel 5.12.19.
That narrows it down a good bit. can you give me the lsmod output on a 5.13 kernel please?
(In reply to Iñaki Ucar from comment #6) > No issues with kernel 5.12.19. Correction, I booted the wrong kernel: the issue *is* present with kernel 5.12.19.
And I found a better way to reproduce the issue: - Run `stress --cpu 8`. - When the temperature is stable, suspend & resume. Temperature goes nuts with kernel >= 5.12.19 and the laptop shuts down.
great, so it came in that patch set most likely. what is the lsmod output?
Created attachment 1815118 [details] lsmod output for 5.12.7
Created attachment 1815119 [details] lsmod output for 5.12.19
Created attachment 1815120 [details] lsmod output for 5.13.8
lsmod output for three kernels attached. I see no differences between 5.12.7 and 5.12.19 (apart from the VirtualBox modules). So I suppose that changes in the following modules would be suspicious: acpi_pad acpi_thermal_rel ... coretemp ... int3400_thermal int3403_thermal int340x_thermal_zone intel_cstate intel_pch_thermal intel_pmc_bxt intel_powerclamp intel_rapl_common intel_rapl_msr intel_soc_dts_iosf intel_uncore ... pinctrl_cannonlake processor_thermal_device processor_thermal_mbox processor_thermal_rapl processor_thermal_rfim rapl ... x86_pkg_temp_thermal Also, the output from sensors may be helpful: coretemp-isa-0000 Adapter: ISA adapter Package id 0: +42.0°C (high = +100.0°C, crit = +100.0°C) Core 0: +42.0°C (high = +100.0°C, crit = +100.0°C) Core 1: +41.0°C (high = +100.0°C, crit = +100.0°C) Core 2: +41.0°C (high = +100.0°C, crit = +100.0°C) Core 3: +42.0°C (high = +100.0°C, crit = +100.0°C) CMB0-acpi-0 Adapter: ACPI interface in0: 7.79 V iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +37.0°C pch_cannonlake-virtual-0 Adapter: Virtual device temp1: +42.0°C acpitz-acpi-0 Adapter: ACPI interface temp1: +32.0°C (crit = +119.0°C) The coretemp temperature is the one that goes nuts from 5.12.19 on.
I experience exactly the same problem. It happened for the first time on Jul 26th, after kernel upgrade from 5.12.15 to 5.13.4. Currently, reading this issue, I was trying with 5.12.18-200.fc33.x86_64, but the problem is still there. So I think it's a change between 5.12.15 and 5.12.18. I'm using Lenovo Thinkpad P1 Gen 2 (i7-9750H). The log entry before shutdown: kernel: thermal thermal_zone0: acpitz: critical temperature reached, shutting down sensors output: iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +43.0°C ucsi_source_psy_USBC000:001-isa-0000 Adapter: ISA adapter in0: 0.00 V (min = +0.00 V, max = +0.00 V) curr1: 0.00 A (max = +0.00 A) thinkpad-isa-0000 Adapter: ISA adapter fan1: 2468 RPM fan2: 2184 RPM temp1: +47.0°C temp2: +46.0°C temp3: +0.0°C temp4: +0.0°C temp5: +0.0°C temp6: +0.0°C temp7: +0.0°C temp8: N/A BAT0-acpi-0 Adapter: ACPI interface in0: 17.07 V coretemp-isa-0000 Adapter: ISA adapter Package id 0: +53.0°C (high = +100.0°C, crit = +100.0°C) Core 0: +51.0°C (high = +100.0°C, crit = +100.0°C) Core 1: +53.0°C (high = +100.0°C, crit = +100.0°C) Core 2: +46.0°C (high = +100.0°C, crit = +100.0°C) Core 3: +48.0°C (high = +100.0°C, crit = +100.0°C) Core 4: +46.0°C (high = +100.0°C, crit = +100.0°C) Core 5: +45.0°C (high = +100.0°C, crit = +100.0°C) ucsi_source_psy_USBC000:002-isa-0000 Adapter: ISA adapter in0: 0.00 V (min = +0.00 V, max = +0.00 V) curr1: 0.00 A (max = +0.00 A) pch_cannonlake-virtual-0 Adapter: Virtual device temp1: +44.0°C nvme-pci-0200 Adapter: PCI adapter Composite: +38.9°C (low = -273.1°C, high = +83.8°C) (crit = +84.8°C) Sensor 1: +38.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +38.9°C (low = -273.1°C, high = +65261.8°C) acpitz-acpi-0 Adapter: ACPI interface temp1: +47.0°C (crit = +128.0°C)
Can you try this scratch build and see if it fixes the problem for you? https://koji.fedoraproject.org/koji/taskinfo?taskID=74457963
(In reply to Justin M. Forbes from comment #16) > Can you try this scratch build and see if it fixes the problem for you? > > https://koji.fedoraproject.org/koji/taskinfo?taskID=74457963 Yes, it does! What was the issue?
I reverted: commit fe6a6de6692e7f7159c1ff42b07ecd737df712b4 Author: Srinivas Pandruvada <srinivas.pandruvada.com> Date: Mon Jun 28 14:58:03 2021 -0700 thermal/drivers/int340x/processor_thermal: Fix tcc setting The following fixes are done for tcc sysfs interface: - TCC is 6 bits only from bit 29-24 - TCC of 0 is valid - When BIT(31) is set, this register is read only - Check for invalid tcc value - Error for negative values However, I don't see where the patch itself is incorrect, and it is changing sysfs exports. I would be surprised if thermald did not understand these changes, as that is the expected interface to work with int340x, so I would have assumed they tested changes there. Let me do some digging into the thermald code and see what the issue might be.
But thermald is not present in my system.
I am pretty sure at this point that upstream expects thermald is the primary method for maintaining temperature on a modern intel based laptop. Perhaps you should install it and see if that makes things work with a proper 5.13.13 build?
I disagree. thermald can certainly be an improvement in *performance* with respect to the default thermal management, but the kernel cannot rely on an external userspace daemon to *work properly*, it would be completely unnacceptable.
Well, as that patch seems to only be changing the sysfs interface, *something* in userspace is causing the behavior to change, as sysfs is how the kernel exports such things to userspace. If you are not using the userspace controller that is expected at this point, you might want to find out what you are using, and why it doesn't behave well with the kernel changes for error checking. There are plenty of instances where the kernel provides mechanism, and depends on userspace to provide policy.
Then that revert is not necessary, and it must be something else after 5.13.8, because I'm pretty sure I'm not using anything in userspace.
(In reply to Iñaki Ucar from comment #23) > Then that revert is not necessary, and it must be something else after > 5.13.8, because I'm pretty sure I'm not using anything in userspace. Nope. 5.13.13 in @updates-testing shows the same issue. It really is that commit, so it must be some side effect. In fact, the patch changes tcc_offset_update, and, AFAICT, that influences more than just the sysfs interface.
(In reply to Justin M. Forbes from comment #18) > I reverted: > > commit fe6a6de6692e7f7159c1ff42b07ecd737df712b4 > Author: Srinivas Pandruvada <srinivas.pandruvada.com> > Date: Mon Jun 28 14:58:03 2021 -0700 > > thermal/drivers/int340x/processor_thermal: Fix tcc setting > > The following fixes are done for tcc sysfs interface: > - TCC is 6 bits only from bit 29-24 > - TCC of 0 is valid > - When BIT(31) is set, this register is read only > - Check for invalid tcc value > - Error for negative values > > However, I don't see where the patch itself is incorrect, and it is changing > sysfs exports. I'm having the same issue on my laptop. Looking at the above commit, if I got this correctly, I believe there's a kernel bug. The bug isn't in the commit itself, but was hidden before the change. When looking at the suspend/resume logic in the driver, one global variable is used to store the current offset: tcc_offset_save. The variable is used in proc_thermal_resume as an argument to tcc_offset_update when the device resumes. The issue is this variable has a default value of 0 (which is not the h/w default) and is only set when userspace sets tcc_offset_degree_celsius. When userspace is not setting the value explicitly (on my system thermald deactivates itself[1]), tcc_offset_degree_celsius is set to 0 after a suspend/resume. This can be reproduced (on a system where tcc_offset_degree_celsius was *not* set before, i.e. fresh boot, thermald/similar daemons not running) by: 1. Checking the value of tcc_offset_degree_celsius. In my case the h/w default is 3. 2. Perform any CPU intensive task (stress --cpu 12); the laptop does *not* shut down. 3. Suspend/resume. 4. tcc_offset_degree_celsius is now 0. 5. Perform any CPU intensive task (stress --cpu 12); the laptop now shuts down. Setting tcc_offset_degree_celsius manually does fix the issue. Future suspend/resume calls would not set the value to 0. This is because commit fe6a6de6692e changed a return condition in tcc_offset_update: -static int tcc_offset_update(int tcc) +static int tcc_offset_update(unsigned int tcc) { u64 val; int err; - if (!tcc) + if (tcc > 63) return -EINVAL; Before the change a value of 0 would not update the register behind tcc_offset_update. (I don't believe reverting this is the right fix though, as 0 is a valid value. Setting tcc_offset_save to the register default value looks better. Or maybe adding a suspend helper to store the value instead of doing so when updating tcc_offset_update.) [1] "[/sys/devices/platform/thinkpad_acpi/dytc_lapmode] present: Thermald can't run on this platform"
Thank you for that analysis. Want to send that upstream and see if we can get a proper fix for this?
(In reply to Justin M. Forbes from comment #26) > Want to send that upstream and see if we can get a proper fix for this? Sure, I just sent a patch upstream: https://lore.kernel.org/linux-pm/20210908161632.15520-1-atenart@kernel.org/T/#u In addition, here is a workaround (to be run after each cold boot): # echo $(cat tcc_offset_degree_celsius) > tcc_offset_degree_celsius
I sent a v2 (the fix is the same, but only one part is now targeted for stable kernels to ease the backports), https://lore.kernel.org/linux-pm/20210909085613.5577-1-atenart@kernel.org/T/
(In reply to Antoine Tenart from comment #28) > I sent a v2 (the fix is the same, but only one part is now targeted for > stable kernels to ease the backports), > https://lore.kernel.org/linux-pm/20210909085613.5577-1-atenart@kernel.org/T/ The fix is included in v5.15-rc3[1] and queued for stable. [1] https://lore.kernel.org/linux-pm/163268466277.21680.15607448515937446683.pr-tracker-bot@kernel.org/T/
Will this be backported to other stable branches?
(In reply to Iñaki Ucar from comment #30) > Will this be backported to other stable branches? Yes, it is queued[1] for stable branches upstream. Next (impacted) stable releases should include the fix. [1] Not in their git tree yet though.
Ok, thanks for looking into this and for the fix.
This is fixed in the following upstream stable kernels: 5.14.9, 5.10.70 and 5.4.150. (5.13.y is EOL).
Fixed in kernel-5.14.9.