1992706 – Kernel thermal misconfiguration makes CPU overheat

Bug 1992706 - Kernel thermal misconfiguration makes CPU overheat

Summary: Kernel thermal misconfiguration makes CPU overheat

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-11 15:35 UTC by Iñaki Ucar
Modified:	2021-10-05 08:13 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-10-05 08:13:32 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
kernel logs (103.29 KB, text/plain) 2021-08-11 15:35 UTC, Iñaki Ucar	no flags	Details
lsmod output for 5.12.7 (7.92 KB, text/plain) 2021-08-18 09:27 UTC, Iñaki Ucar	no flags	Details
lsmod output for 5.12.19 (7.81 KB, text/plain) 2021-08-18 09:27 UTC, Iñaki Ucar	no flags	Details
lsmod output for 5.13.8 (7.90 KB, text/plain) 2021-08-18 09:28 UTC, Iñaki Ucar	no flags	Details
View All

Description Iñaki Ucar 2021-08-11 15:35:33 UTC

Created attachment 1813183 [details]
kernel logs

1. Please describe the problem:

I'm experiencing this in my Intel-based laptop (LG Gram). When the CPU is idle and cool, so that the CPU fan is off, if I start a CPU-demanding load (such as a compilation), the processor quickly overheats reaching the critical temperature before the fan can reach the maximum speed, and the kernel triggers a shutdown. It started to happen with the 5.13.x series.

Briefly discussed here: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/YGESVMI6SIDMLRJKJEXJ7R3TEESS7BHU/

-  thermald is not installed.
-  intel_tcc_cooling is loaded, but removing it does not help.

2. What is the Version-Release number of the kernel:

It happens with the 5.13.x series. Tested with .4, .5 and .8.

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

No issue with previous kernels. I'm currently running 5.12.7-300.fc34.x86_64 with no issues: the fan reaches maximum speed quickly enough to control the temperature.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

-  Suspend the laptop and wait a few minutes until it cools down.
-  Resume the session.
-  Launch a compilation task when the sensors' output shows a temperature of ~40ºC for the processor.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Not tested yet.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Log attached.

Comment 1 Iñaki Ucar 2021-08-16 20:59:39 UTC

The issue persists with kernel 5.14.0-0.rc5.20210813gitf8e6dfc64f61.46.fc36. The only difference is that I get an additional line in the logs compared to 5.13 (the second one below):

  ago 16 22:49:50 kernel: thermal thermal_zone0: acpitz: critical temperature reached, shutting down
  ago 16 22:49:50 kernel: reboot: HARDWARE PROTECTION shutdown (Temperature too high)

The laptop is basically unusable with any kernel >= 5.13.

Comment 2 Justin M. Forbes 2021-08-17 14:53:57 UTC

I assume if you append thermal.off=1 to the grub command line, this goes away?

Comment 3 Iñaki Ucar 2021-08-17 15:06:47 UTC

I don't know. I'm not willing to risk the computer. It shuts down *because* the CPU is actually reaching the critical temperature.

Comment 4 Justin M. Forbes 2021-08-17 16:50:37 UTC

Force the fan to run at full instead of auto and see if it still shuts down. If so, it means your laptop is manufactured in a way as to not be able to handle the actual thermal load (not as uncommon as you think). If not, it means we have issues with kernel thermal management. Another thing worth trying, is kernel-5.12.19.  https://koji.fedoraproject.org/koji/buildinfo?buildID=1782372  There were no acpi thermal updates to 5.13 at all, but a few fan updates did show up in 5.14 merge window and got backported for stable with 5.13.2 and 5.12.17. It would certainly help narrow down the patch that brought this forward.  The data with the current (bad) kernel and the fans forced to full is still an interesting data point though, as it could be that the new patches are correct, and the hardware requires some specific finesse to keep it clocked lower.

Comment 5 Iñaki Ucar 2021-08-17 17:03:26 UTC

Any idea how to force the fan at full speed? I see no way of controlling it, and pwmconfig says that "there are no pwm-capable sensor modules installed".

Comment 6 Iñaki Ucar 2021-08-17 18:06:48 UTC

No issues with kernel 5.12.19.

Comment 7 Justin M. Forbes 2021-08-17 18:19:29 UTC

That narrows it down a good bit. can you give me the lsmod output on a 5.13 kernel please?

Comment 8 Iñaki Ucar 2021-08-17 18:30:40 UTC

(In reply to Iñaki Ucar from comment #6)
> No issues with kernel 5.12.19.

Correction, I booted the wrong kernel: the issue *is* present with kernel 5.12.19.

Comment 9 Iñaki Ucar 2021-08-17 18:48:23 UTC

And I found a better way to reproduce the issue:

- Run `stress --cpu 8`.
- When the temperature is stable, suspend & resume.

Temperature goes nuts with kernel >= 5.12.19 and the laptop shuts down.

Comment 10 Justin M. Forbes 2021-08-17 18:56:34 UTC

great, so it came in that patch set most likely. what is the lsmod output?

Comment 11 Iñaki Ucar 2021-08-18 09:27:08 UTC

Created attachment 1815118 [details]
lsmod output for 5.12.7

Comment 12 Iñaki Ucar 2021-08-18 09:27:48 UTC

Created attachment 1815119 [details]
lsmod output for 5.12.19

Comment 13 Iñaki Ucar 2021-08-18 09:28:19 UTC

Created attachment 1815120 [details]
lsmod output for 5.13.8

Comment 14 Iñaki Ucar 2021-08-18 09:38:12 UTC

lsmod output for three kernels attached. I see no differences between 5.12.7 and 5.12.19 (apart from the VirtualBox modules). So I suppose that changes in the following modules would be suspicious:

  acpi_pad
  acpi_thermal_rel
  ...
  coretemp
  ...
  int3400_thermal
  int3403_thermal
  int340x_thermal_zone
  intel_cstate
  intel_pch_thermal
  intel_pmc_bxt
  intel_powerclamp
  intel_rapl_common
  intel_rapl_msr
  intel_soc_dts_iosf
  intel_uncore
  ...
  pinctrl_cannonlake
  processor_thermal_device
  processor_thermal_mbox
  processor_thermal_rapl
  processor_thermal_rfim
  rapl
  ...
  x86_pkg_temp_thermal

Also, the output from sensors may be helpful:

  coretemp-isa-0000
  Adapter: ISA adapter
  Package id 0:  +42.0°C  (high = +100.0°C, crit = +100.0°C)
  Core 0:        +42.0°C  (high = +100.0°C, crit = +100.0°C)
  Core 1:        +41.0°C  (high = +100.0°C, crit = +100.0°C)
  Core 2:        +41.0°C  (high = +100.0°C, crit = +100.0°C)
  Core 3:        +42.0°C  (high = +100.0°C, crit = +100.0°C)

  CMB0-acpi-0
  Adapter: ACPI interface
  in0:           7.79 V  

  iwlwifi_1-virtual-0
  Adapter: Virtual device
  temp1:        +37.0°C  

  pch_cannonlake-virtual-0
  Adapter: Virtual device
  temp1:        +42.0°C  

  acpitz-acpi-0
  Adapter: ACPI interface
  temp1:        +32.0°C  (crit = +119.0°C)

The coretemp temperature is the one that goes nuts from 5.12.19 on.

Comment 15 Robert Jaros 2021-08-22 15:35:19 UTC

I experience exactly the same problem. It happened for the first time on Jul 26th, after kernel upgrade from 5.12.15 to 5.13.4.
Currently, reading this issue, I was trying with 5.12.18-200.fc33.x86_64, but the problem is still there. So I think it's a change between 5.12.15 and 5.12.18.
I'm using Lenovo Thinkpad P1 Gen 2 (i7-9750H).

The log entry before shutdown:

kernel: thermal thermal_zone0: acpitz: critical temperature reached, shutting down


sensors output:

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:        +43.0°C  

ucsi_source_psy_USBC000:001-isa-0000
Adapter: ISA adapter
in0:           0.00 V  (min =  +0.00 V, max =  +0.00 V)
curr1:         0.00 A  (max =  +0.00 A)

thinkpad-isa-0000
Adapter: ISA adapter
fan1:        2468 RPM
fan2:        2184 RPM
temp1:        +47.0°C  
temp2:        +46.0°C  
temp3:         +0.0°C  
temp4:         +0.0°C  
temp5:         +0.0°C  
temp6:         +0.0°C  
temp7:         +0.0°C  
temp8:            N/A  

BAT0-acpi-0
Adapter: ACPI interface
in0:          17.07 V  

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +53.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +53.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +46.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +48.0°C  (high = +100.0°C, crit = +100.0°C)
Core 4:        +46.0°C  (high = +100.0°C, crit = +100.0°C)
Core 5:        +45.0°C  (high = +100.0°C, crit = +100.0°C)

ucsi_source_psy_USBC000:002-isa-0000
Adapter: ISA adapter
in0:           0.00 V  (min =  +0.00 V, max =  +0.00 V)
curr1:         0.00 A  (max =  +0.00 A)

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1:        +44.0°C  

nvme-pci-0200
Adapter: PCI adapter
Composite:    +38.9°C  (low  = -273.1°C, high = +83.8°C)
                       (crit = +84.8°C)
Sensor 1:     +38.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +38.9°C  (low  = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +47.0°C  (crit = +128.0°C)

Comment 16 Justin M. Forbes 2021-08-24 21:02:51 UTC

Can you try this scratch build and see if it fixes the problem for you?

https://koji.fedoraproject.org/koji/taskinfo?taskID=74457963

Comment 17 Iñaki Ucar 2021-08-29 21:49:33 UTC

(In reply to Justin M. Forbes from comment #16)
> Can you try this scratch build and see if it fixes the problem for you?
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=74457963

Yes, it does! What was the issue?

Comment 18 Justin M. Forbes 2021-08-30 13:52:10 UTC

I reverted:

commit fe6a6de6692e7f7159c1ff42b07ecd737df712b4
Author: Srinivas Pandruvada <srinivas.pandruvada.com>
Date:   Mon Jun 28 14:58:03 2021 -0700

    thermal/drivers/int340x/processor_thermal: Fix tcc setting
    
    The following fixes are done for tcc sysfs interface:
    - TCC is 6 bits only from bit 29-24
    - TCC of 0 is valid
    - When BIT(31) is set, this register is read only
    - Check for invalid tcc value
    - Error for negative values

However, I don't see where the patch itself is incorrect, and it is changing sysfs exports. I would be surprised if thermald did not understand these changes, as that is the expected interface to work with int340x, so I would have assumed they tested changes there.  Let me do some digging into the thermald code and see what the issue might be.

Comment 19 Iñaki Ucar 2021-08-30 22:24:29 UTC

But thermald is not present in my system.

Comment 20 Justin M. Forbes 2021-08-30 22:53:23 UTC

I am pretty sure at this point that upstream expects thermald is the primary method for maintaining temperature on a modern intel based laptop. Perhaps you should install it and see if that makes things work with a proper 5.13.13 build?

Comment 21 Iñaki Ucar 2021-08-31 14:21:14 UTC

I disagree. thermald can certainly be an improvement in *performance* with respect to the default thermal management, but the kernel cannot rely on an external userspace daemon to *work properly*, it would be completely unnacceptable.

Comment 22 Justin M. Forbes 2021-08-31 22:30:57 UTC

Well, as that patch seems to only be changing the sysfs interface, *something* in userspace is causing the behavior to change, as sysfs is how the kernel exports such things to userspace. If you are not using the userspace controller that is expected at this point, you might want to find out what you are using, and why it doesn't behave well with the kernel changes for error checking. There are plenty of instances where the kernel provides mechanism, and depends on userspace to provide policy.

Comment 23 Iñaki Ucar 2021-08-31 22:41:40 UTC

Then that revert is not necessary, and it must be something else after 5.13.8, because I'm pretty sure I'm not using anything in userspace.

Comment 24 Iñaki Ucar 2021-08-31 23:04:45 UTC

(In reply to Iñaki Ucar from comment #23)
> Then that revert is not necessary, and it must be something else after
> 5.13.8, because I'm pretty sure I'm not using anything in userspace.

Nope. 5.13.13 in @updates-testing shows the same issue. It really is that commit, so it must be some side effect. In fact, the patch changes tcc_offset_update, and, AFAICT, that influences more than just the sysfs interface.

Comment 25 Antoine Tenart 2021-09-08 10:04:17 UTC

(In reply to Justin M. Forbes from comment #18)
> I reverted:
> 
> commit fe6a6de6692e7f7159c1ff42b07ecd737df712b4
> Author: Srinivas Pandruvada <srinivas.pandruvada.com>
> Date:   Mon Jun 28 14:58:03 2021 -0700
> 
>     thermal/drivers/int340x/processor_thermal: Fix tcc setting
>     
>     The following fixes are done for tcc sysfs interface:
>     - TCC is 6 bits only from bit 29-24
>     - TCC of 0 is valid
>     - When BIT(31) is set, this register is read only
>     - Check for invalid tcc value
>     - Error for negative values
> 
> However, I don't see where the patch itself is incorrect, and it is changing
> sysfs exports.

I'm having the same issue on my laptop. Looking at the above commit, if I got this correctly, I believe there's a kernel bug. The bug isn't in the commit itself, but was hidden before the change.

When looking at the suspend/resume logic in the driver, one global variable is used to store the current offset: tcc_offset_save. The variable is used in proc_thermal_resume as an argument to tcc_offset_update when the device resumes. The issue is this variable has a default value of 0 (which is not the h/w default) and is only set when userspace sets tcc_offset_degree_celsius. When userspace is not setting the value explicitly (on my system thermald deactivates itself[1]), tcc_offset_degree_celsius is set to 0 after a suspend/resume.

This can be reproduced (on a system where tcc_offset_degree_celsius was *not* set before, i.e. fresh boot, thermald/similar daemons not running) by:

1. Checking the value of tcc_offset_degree_celsius. In my case the h/w default is 3.
2. Perform any CPU intensive task (stress --cpu 12); the laptop does *not* shut down.
3. Suspend/resume.
4. tcc_offset_degree_celsius is now 0.
5. Perform any CPU intensive task (stress --cpu 12); the laptop now shuts down.

Setting tcc_offset_degree_celsius manually does fix the issue. Future suspend/resume calls would not set the value to 0.

This is because commit fe6a6de6692e changed a return condition in tcc_offset_update:

  -static int tcc_offset_update(int tcc)
  +static int tcc_offset_update(unsigned int tcc)
   {
          u64 val;
          int err;

  -       if (!tcc)
  +       if (tcc > 63)
                  return -EINVAL;

Before the change a value of 0 would not update the register behind tcc_offset_update.

(I don't believe reverting this is the right fix though, as 0 is a valid value. Setting tcc_offset_save to the register default value looks better. Or maybe adding a suspend helper to store the value instead of doing so when updating tcc_offset_update.)

[1] "[/sys/devices/platform/thinkpad_acpi/dytc_lapmode] present: Thermald can't run on this platform"

Comment 26 Justin M. Forbes 2021-09-08 14:53:15 UTC

Thank you for that analysis. Want to send that upstream and see if we can get a proper fix for this?

Comment 27 Antoine Tenart 2021-09-08 16:25:31 UTC

(In reply to Justin M. Forbes from comment #26)
> Want to send that upstream and see if we can get a proper fix for this?

Sure, I just sent a patch upstream:
https://lore.kernel.org/linux-pm/20210908161632.15520-1-atenart@kernel.org/T/#u

In addition, here is a workaround (to be run after each cold boot):
# echo $(cat tcc_offset_degree_celsius) > tcc_offset_degree_celsius

Comment 28 Antoine Tenart 2021-09-09 09:02:25 UTC

I sent a v2 (the fix is the same, but only one part is now targeted for stable kernels to ease the backports),
https://lore.kernel.org/linux-pm/20210909085613.5577-1-atenart@kernel.org/T/

Comment 29 Antoine Tenart 2021-09-27 08:00:46 UTC

(In reply to Antoine Tenart from comment #28)
> I sent a v2 (the fix is the same, but only one part is now targeted for
> stable kernels to ease the backports),
> https://lore.kernel.org/linux-pm/20210909085613.5577-1-atenart@kernel.org/T/

The fix is included in v5.15-rc3[1] and queued for stable.

[1] https://lore.kernel.org/linux-pm/163268466277.21680.15607448515937446683.pr-tracker-bot@kernel.org/T/

Comment 30 Iñaki Ucar 2021-09-27 08:20:45 UTC

Will this be backported to other stable branches?

Comment 31 Antoine Tenart 2021-09-27 08:26:09 UTC

(In reply to Iñaki Ucar from comment #30)
> Will this be backported to other stable branches?

Yes, it is queued[1] for stable branches upstream. Next (impacted) stable releases should include the fix.

[1] Not in their git tree yet though.

Comment 32 Iñaki Ucar 2021-09-27 08:28:17 UTC

Ok, thanks for looking into this and for the fix.

Comment 33 Antoine Tenart 2021-09-30 12:22:57 UTC

This is fixed in the following upstream stable kernels: 5.14.9, 5.10.70 and 5.4.150. (5.13.y is EOL).

Comment 34 Antoine Tenart 2021-10-05 08:13:32 UTC

Fixed in kernel-5.14.9.

Note You need to log in before you can comment on or make changes to this bug.

acaringi
adscvr
airlied
alciregi
atenart
bskeggs
hdegoede
jarodwilson
jeremy
jforbes
jglisse
jonathan
josef
kernel-maint
lgoncalv
linville
masami256
massi.ergosum
mattehartog
mchehab
ptalbert
rjaros
samuel-rhbugs
steved
zkraus