Bug 924570

Summary:

regression, package temp above normal induced mce

Product:

[Fedora] Fedora

Reporter:

Chris Murphy <bugzilla>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

NEW ---

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

rawhide

CC:

aiden449, angystardust, barletz, bhubbard, bruno.cornec, bugzilla, bugzilla, cagney, choeger, c.justin88, dazo, dgsiegel, dr.diesel, eminguez, emmanuel.kowalski, euroelessar, fabrice, gansalmon, herrold, itamar, jarmofin, jesse, jnordell, john.mora, jonathan, jpittman, juha.heljoranta, kerncece, kernel-maint, konstantinos.smanis, lantw44, madam, madhu.chinakonda, mirosiko, mlombard, m.mcnutt, neteler, nfink95, nobody+385537, nphilipp, oholy, ormandj, paulo.fidalgo.pt, pcfe, peter, samuel-rhbugs, sean+rh, sergio, stanley.king, tadej.j, tchollingsworth, tflink, tommy, uckelman, vlee, vromanso, wmp

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-07-19 10:07:59 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
dmesg	none
syslog	none
dmesg 3.11.2-301.fc20.x86_64	none
journalctl -b for kernel 3.14 rc2	none
dmesg 3.16.0-0.rc7.git4.1	none

Description Chris Murphy 2013-03-22 05:51:35 UTC

Description of problem:
During CPU intensive tasks, dmesg reports cores above temperatures, throttling, then an mce. New problem with 3.9.0 kernel on same tasks with same hardware with 3.8.x, 3.7.x, 3.6.x kernels.

Version-Release number of selected component (if applicable):
kernel 3.9.0-0.rc3.git1.3.fc19.x86_64

How reproducible:
Always.


Steps to Reproduce:
1. BTRFS scrub

  
Actual results:
Will attach dmesg and syslog of results. No other obvious manifestations than the messages.

Expected results:
Not this.

Additional info:

Comment 1 Chris Murphy 2013-03-22 05:53:12 UTC

Created attachment 714304 [details]
dmesg

full dmesg. snippet with errors:


[ 3403.381085] CPU6: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381087] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381088] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381091] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381094] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381119] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381120] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381122] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381124] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381496] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.382109] CPU0: Package temperature/speed normal
[ 3403.382111] CPU4: Package temperature/speed normal
[ 3403.382114] CPU5: Package temperature/speed normal
[ 3403.382115] CPU3: Package temperature/speed normal
[ 3403.382117] CPU7: Package temperature/speed normal
[ 3403.382118] CPU1: Package temperature/speed normal
[ 3403.382119] CPU6: Core temperature/speed normal
[ 3403.382120] CPU2: Core temperature/speed normal
[ 3403.382120] CPU6: Package temperature/speed normal
[ 3403.382121] CPU2: Package temperature/speed normal
[ 3598.152029] mce: [Hardware Error]: Machine check events logged

Comment 2 Chris Murphy 2013-03-22 05:56:45 UTC

Created attachment 714306 [details]
syslog

snippet from syslog

Comment 3 Josh Boyer 2013-09-18 20:55:50 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.11.1-200.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 4 Chris Murphy 2013-10-02 03:22:13 UTC

Still present with 3.11.2-301.fc20.x86_64

Comment 5 Chris Murphy 2013-10-02 03:23:36 UTC

Created attachment 806241 [details]
dmesg 3.11.2-301.fc20.x86_64

Comment 6 markusN 2014-02-11 00:10:52 UTC

Happens in 3.12.9-301.fc20.x86_64 - in previous kernels I did not have such
problems when compiling source code:

[136510.805360] CPU2: Core temperature above threshold, cpu clock throttled (total events = 112479)
[136510.805362] CPU0: Core temperature above threshold, cpu clock throttled (total events = 112478)
[136510.805366] CPU3: Package temperature above threshold, cpu clock throttled (total events = 157878)
[136510.805368] CPU1: Package temperature above threshold, cpu clock throttled (total events = 157878)
[136510.805369] CPU0: Package temperature above threshold, cpu clock throttled (total events = 157877)
[136510.805382] CPU2: Package temperature above threshold, cpu clock throttled (total events = 157878)
[136510.807341] CPU2: Core temperature/speed normal
[136510.807343] CPU0: Core temperature/speed normal
[136510.807345] CPU1: Package temperature/speed normal
[136510.807346] CPU3: Package temperature/speed normal
[136510.807347] CPU0: Package temperature/speed normal
[136510.807357] CPU2: Package temperature/speed normal
[136606.148339] mce: [Hardware Error]: Machine check events logged

Comment 7 Chris Murphy 2014-02-16 20:32:29 UTC

Created attachment 863832 [details]
journalctl -b for kernel 3.14 rc2

This still happens with 3.14.0-0.rc2.git0.1.fc21.x86_64. It's pretty much always triggered by yum or dnf getting busy and making the laptop hot while sounding like it's a hair dryer.

What make me think this is bogus is that the trip temperature is exceeded at 900.69 seconds and then is below trip temperature at 900.70 seconds.


[  901.969534] f20c.localdomain kernel: mce: [Hardware Error]: Machine check events logged
[  900.690190] f20c.localdomain mcelog[581]: Hardware event. This is not a software error.
[  900.691170] f20c.localdomain mcelog[581]: MCE 0
[  900.691854] f20c.localdomain mcelog[581]: CPU 1 THERMAL EVENT TSC 1693342dfcc
[  900.692551] f20c.localdomain mcelog[581]: TIME 1392524773 Sat Feb 15 21:26:13 2014
[  900.693256] f20c.localdomain mcelog[581]: Processor 1 heated above trip temperature. Throttling enabled.
[  900.693927] f20c.localdomain mcelog[581]: Please check your system cooling. Performance will be impacted
[  900.694641] f20c.localdomain mcelog[581]: STATUS 880003c3 MCGSTATUS 0
[  900.695337] f20c.localdomain mcelog[581]: MCGCAP c09 APICID 2 SOCKETID 0
[  900.696015] f20c.localdomain mcelog[581]: CPUID Vendor Intel Family 6 Model 42
[  900.696678] f20c.localdomain mcelog[581]: Hardware event. This is not a software error.
[  900.697342] f20c.localdomain mcelog[581]: MCE 1
[  900.698027] f20c.localdomain mcelog[581]: CPU 5 THERMAL EVENT TSC 1693342fb66
[  900.698698] f20c.localdomain mcelog[581]: TIME 1392524773 Sat Feb 15 21:26:13 2014
[  900.699344] f20c.localdomain mcelog[581]: Processor 5 heated above trip temperature. Throttling enabled.
[  900.699947] f20c.localdomain mcelog[581]: Please check your system cooling. Performance will be impacted
[  900.700535] f20c.localdomain mcelog[581]: STATUS 880003c3 MCGSTATUS 0
[  900.701123] f20c.localdomain mcelog[581]: MCGCAP c09 APICID 3 SOCKETID 0
[  900.701720] f20c.localdomain mcelog[581]: CPUID Vendor Intel Family 6 Model 42
[  900.702391] f20c.localdomain mcelog[581]: Hardware event. This is not a software error.
[  900.702962] f20c.localdomain mcelog[581]: MCE 2
[  900.703579] f20c.localdomain mcelog[581]: CPU 1 THERMAL EVENT TSC 1693365cb04
[  900.704192] f20c.localdomain mcelog[581]: TIME 1392524773 Sat Feb 15 21:26:13 2014
[  900.704776] f20c.localdomain mcelog[581]: Processor 1 below trip temperature. Throttling disabled
[  900.705385] f20c.localdomain mcelog[581]: STATUS 88010282 MCGSTATUS 0
[  900.705980] f20c.localdomain mcelog[581]: MCGCAP c09 APICID 2 SOCKETID 0
[  900.706561] f20c.localdomain mcelog[581]: CPUID Vendor Intel Family 6 Model 42
[  900.707147] f20c.localdomain mcelog[581]: Hardware event. This is not a software error.
[  900.707716] f20c.localdomain mcelog[581]: MCE 3
[  900.708333] f20c.localdomain mcelog[581]: CPU 5 THERMAL EVENT TSC 16933662906
[  900.708923] f20c.localdomain mcelog[581]: TIME 1392524773 Sat Feb 15 21:26:13 2014
[  900.709557] f20c.localdomain mcelog[581]: Processor 5 below trip temperature. Throttling disabled
[  900.710102] f20c.localdomain mcelog[581]: STATUS 88010282 MCGSTATUS 0
[  900.710633] f20c.localdomain mcelog[581]: MCGCAP c09 APICID 3 SOCKETID 0
[  900.711160] f20c.localdomain mcelog[581]: CPUID Vendor Intel Family 6 Model 42

Comment 8 Andy Lawrence 2014-02-16 21:10:57 UTC

Chris, what are the actual core temps when this happens?

Comment 9 Chris Murphy 2014-02-16 22:24:05 UTC

On boot after installing lm_sensors, I get messages for each core:
kernel: CPU2: Package temperature above threshold, cpu clock throttled 

There is no mce event, and fans are noticeable but not loud. This is the result from the sensors command at that time:


# sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +81.0°C  (high = +86.0°C, crit = +100.0°C)
Core 0:         +80.0°C  (high = +86.0°C, crit = +100.0°C)
Core 1:         +80.0°C  (high = +86.0°C, crit = +100.0°C)
Core 2:         +80.0°C  (high = +86.0°C, crit = +100.0°C)
Core 3:         +76.0°C  (high = +86.0°C, crit = +100.0°C)

pkg-temp-0-virtual-0
Adapter: Virtual device
temp1:        +80.0°C  

applesmc-isa-0300
Adapter: ISA adapter
Left side  : 3695 RPM  (min = 2000 RPM, max = 6200 RPM)
Right side : 3685 RPM  (min = 2000 RPM, max = 6200 RPM)
TB0T:         +30.2°C  
TB1T:         +30.2°C  
TB2T:         +29.2°C  
TC0C:         +77.0°C  
TC0D:         +78.2°C  
TC0E:         +88.0°C  
TC0F:         +90.0°C  
TC0P:         +66.5°C  
TC1C:         +75.0°C  
TC2C:         +75.0°C  
TC3C:         +75.0°C  
TC4C:         +74.0°C  
TCGC:         +75.0°C  
TCSA:         +75.0°C  
TCTD:          -1.0°C  
TG0D:         +74.2°C  
TG0P:         +71.5°C  
THSP:         +44.0°C  
TM0S:         +59.0°C  
TMBS:          +0.0°C  
TP0P:         +58.8°C  
TPCD:         +60.0°C  
TW0P:        -127.0°C  
Th1H:         +63.0°C

Comment 10 Andy Lawrence 2014-02-16 22:35:58 UTC

Looks like you have a genuine cooling issue, MCE is doing it's job.

If you drop back to 3.8.x, 3.7.x, 3.6.x kernels, does the problem disappear?  Does top show a huge process, any chance your cooling intake sucked up a furball?

Comment 11 Andy Lawrence 2014-02-16 22:39:27 UTC

Sorry, hung process, not huge.

Comment 12 Chris Murphy 2014-02-16 22:59:43 UTC

Problem doesn't occur on older kernels. top shows fractional percent usages. The fans, intake, exhaust are all clean - this is a Macbook Pro laptop. I don't get anywhere near the amount of heat when running OS X as with linux. Even right now while idle it's never idling the fans and it's quite warm.

Comment 13 Chris Murphy 2014-02-16 23:17:46 UTC

So this might partially be a radeon driver issue. If I use nomodeset at boot, temperatures at ~10C cooler, and fans are idle. It ultimately doesn't solve the problem because even moving the mouse arrow around causes gnome-shell to hit 99% and X to hit 60+%, probably due to the use of llvmpipe, and then I get CPU temperature complaints.

Comment 14 Mukundan Ragavan 2014-02-17 18:21:16 UTC

I am also seeing these symptoms. Mine is a Thinkpad T520 (nvidia driver installed).

journalctl /usr/sbin/mcelog output (partial)

Jan 28 09:33:12 carbon mcelog[863]: CPU 1 THERMAL EVENT TSC 9da0d815f99
Jan 28 09:33:12 carbon mcelog[863]: TIME 1390923054 Tue Jan 28 09:30:54 2014
Jan 28 09:33:12 carbon mcelog[863]: Processor 1 heated above trip temperature. Throttling enabled.
Jan 28 09:33:12 carbon mcelog[863]: Please check your system cooling. Performance will be impacted
Jan 28 09:33:12 carbon mcelog[863]: STATUS 88030003 MCGSTATUS 0


This did not happen in the earlier kernels.

Comment 15 T.C. Hollingsworth 2014-02-18 02:37:46 UTC

I see something like this too.

Here's a particularly egregious example of it going off multiple times within the space of 10 minutes the other day.  I don't even think I was using this machine at the time...

% journalctl -b -u mcelog.service -o short-precise | grep 'Feb 15 16:'
Feb 15 16:12:57.589775 rustin mcelog[620]: Kernel does not support page offline interface
Feb 15 16:12:57.590260 rustin mcelog[620]: Hardware event. This is not a software error.
Feb 15 16:12:57.590726 rustin mcelog[620]: MCE 0
Feb 15 16:12:57.591177 rustin mcelog[620]: CPU 1 THERMAL EVENT TSC 37ba988f13b
Feb 15 16:12:57.591562 rustin mcelog[620]: TIME 1392505915 Sat Feb 15 16:11:55 2014
Feb 15 16:12:57.591954 rustin mcelog[620]: Processor 1 heated above trip temperature. Throttling enabled.
Feb 15 16:12:57.592368 rustin mcelog[620]: Please check your system cooling. Performance will be impacted
Feb 15 16:12:57.592791 rustin mcelog[620]: STATUS 88010003 MCGSTATUS 0
Feb 15 16:12:57.593167 rustin mcelog[620]: MCGCAP 806 APICID 1 SOCKETID 0
Feb 15 16:12:57.593535 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23
Feb 15 16:12:57.593941 rustin mcelog[620]: Hardware event. This is not a software error.
Feb 15 16:12:57.594450 rustin mcelog[620]: MCE 1
Feb 15 16:12:57.594934 rustin mcelog[620]: CPU 1 THERMAL EVENT TSC 37ba99fd339
Feb 15 16:12:57.595306 rustin mcelog[620]: TIME 1392505915 Sat Feb 15 16:11:55 2014
Feb 15 16:12:57.595680 rustin mcelog[620]: Processor 1 below trip temperature. Throttling disabled
Feb 15 16:12:57.598325 rustin mcelog[620]: STATUS 88010002 MCGSTATUS 0
Feb 15 16:12:57.598920 rustin mcelog[620]: MCGCAP 806 APICID 1 SOCKETID 0
Feb 15 16:12:57.599317 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23
Feb 15 16:15:27.590031 rustin mcelog[620]: Hardware event. This is not a software error.
Feb 15 16:15:27.590803 rustin mcelog[620]: MCE 0
Feb 15 16:15:27.591468 rustin mcelog[620]: CPU 0 THERMAL EVENT TSC 3c7146fe9ea
Feb 15 16:15:27.592133 rustin mcelog[620]: TIME 1392506081 Sat Feb 15 16:14:41 2014
Feb 15 16:15:27.592819 rustin mcelog[620]: Processor 0 heated above trip temperature. Throttling enabled.
Feb 15 16:15:27.593459 rustin mcelog[620]: Please check your system cooling. Performance will be impacted
Feb 15 16:15:27.594119 rustin mcelog[620]: STATUS 88010003 MCGSTATUS 0
Feb 15 16:15:27.594813 rustin mcelog[620]: MCGCAP 806 APICID 0 SOCKETID 0
Feb 15 16:15:27.595460 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23
Feb 15 16:15:27.598247 rustin mcelog[620]: Hardware event. This is not a software error.
Feb 15 16:15:27.598960 rustin mcelog[620]: MCE 1
Feb 15 16:15:27.599595 rustin mcelog[620]: CPU 0 THERMAL EVENT TSC 3c71486d368
Feb 15 16:15:27.600289 rustin mcelog[620]: TIME 1392506081 Sat Feb 15 16:14:41 2014
Feb 15 16:15:27.601079 rustin mcelog[620]: Processor 0 below trip temperature. Throttling disabled
Feb 15 16:15:27.601774 rustin mcelog[620]: STATUS 88010002 MCGSTATUS 0
Feb 15 16:15:27.602407 rustin mcelog[620]: MCGCAP 806 APICID 0 SOCKETID 0
Feb 15 16:15:27.603071 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23
Feb 15 16:19:12.589358 rustin mcelog[620]: Hardware event. This is not a software error.
Feb 15 16:19:12.589872 rustin mcelog[620]: MCE 0
Feb 15 16:19:12.590361 rustin mcelog[620]: CPU 1 THERMAL EVENT TSC 432b1a08a06
Feb 15 16:19:12.590856 rustin mcelog[620]: TIME 1392506298 Sat Feb 15 16:18:18 2014
Feb 15 16:19:12.591248 rustin mcelog[620]: Processor 1 heated above trip temperature. Throttling enabled.
Feb 15 16:19:12.591638 rustin mcelog[620]: Please check your system cooling. Performance will be impacted
Feb 15 16:19:12.592054 rustin mcelog[620]: STATUS 88010003 MCGSTATUS 0
Feb 15 16:19:12.592438 rustin mcelog[620]: MCGCAP 806 APICID 1 SOCKETID 0
Feb 15 16:19:12.592832 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23
Feb 15 16:19:12.593381 rustin mcelog[620]: Hardware event. This is not a software error.
Feb 15 16:19:12.593927 rustin mcelog[620]: MCE 1
Feb 15 16:19:12.594464 rustin mcelog[620]: CPU 1 THERMAL EVENT TSC 432b1b769f3
Feb 15 16:19:12.596071 rustin mcelog[620]: TIME 1392506298 Sat Feb 15 16:18:18 2014
Feb 15 16:19:12.596546 rustin mcelog[620]: Processor 1 below trip temperature. Throttling disabled
Feb 15 16:19:12.597162 rustin mcelog[620]: STATUS 88010002 MCGSTATUS 0
Feb 15 16:19:12.597557 rustin mcelog[620]: MCGCAP 806 APICID 1 SOCKETID 0
Feb 15 16:19:12.597963 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23

This machine has Intel graphics so it can't just be radeon, either.

I ought to have logs going back at least a year on this machine so I'll see if I can figure out when it started happening.

Comment 16 T.C. Hollingsworth 2014-02-18 02:49:01 UTC

(In reply to T.C. Hollingsworth from comment #15)
> I ought to have logs going back at least a year on this machine so I'll see
> if I can figure out when it started happening.

I lied, I guess I never adjusted logrotate on this machine, sorry.  :-(

Comment 17 markusN 2014-02-18 13:29:11 UTC

I tested two different kernels, while it does seem to appear in 
kernel 3.12.8-300.fc20.x86_64, the problem appeared in 
kernel 3.12.9-301.fc20.x86_64


lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)

---- log ----

uname -a
Linux oboe.localdomain 3.12.9-301.fc20.x86_64 #1 SMP Wed Jan 29 15:56:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

journalctl  | grep temp | grep above | tail -10
Feb 10 23:25:29 oboe.localdomain mcelog[570]: Processor 0 heated above trip temperature. Throttling enabled.
Feb 10 23:25:29 oboe.localdomain mcelog[570]: Processor 2 heated above trip temperature. Throttling enabled.
Feb 15 14:42:28 oboe.localdomain kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 124388)
Feb 15 14:42:28 oboe.localdomain kernel: CPU2: Core temperature above threshold, cpu clock throttled (total events = 124389)
Feb 15 14:42:28 oboe.localdomain kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 171308)
Feb 15 14:42:29 oboe.localdomain kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 171308)
Feb 15 14:42:29 oboe.localdomain kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 171308)
Feb 15 14:42:29 oboe.localdomain kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 171307)
Feb 15 14:44:04 oboe.localdomain mcelog[570]: Processor 2 heated above trip temperature. Throttling enabled.
Feb 15 14:44:04 oboe.localdomain mcelog[570]: Processor 0 heated above trip temperature. Throttling enabled.

###############################

Linux oboe.localdomain 3.12.8-300.fc20.x86_64 #1 SMP Thu Jan 16 01:07:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
acpitz-virtual-0
Adapter: Virtual device
temp1:        +79.0°C  (crit = +108.0°C)

asus-isa-0000
Adapter: ISA adapter
temp1:        +79.0°C  

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +81.0°C  (high = +87.0°C, crit = +105.0°C)
Core 0:         +81.0°C  (high = +87.0°C, crit = +105.0°C)
Core 1:         +78.0°C  (high = +87.0°C, crit = +105.0°C)

pkg-temp-0-virtual-0
Adapter: Virtual device
temp1:        +81.0°C  

Tue Feb 18 12:16:57 CET 2014

[... compiling GRASS GIS on 4 cores...]

Tue Feb 18 12:17:14 CET 2014
acpitz-virtual-0
Adapter: Virtual device
temp1:        +79.0°C  (crit = +108.0°C)

asus-isa-0000
Adapter: ISA adapter
temp1:        +79.0°C  

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +78.0°C  (high = +87.0°C, crit = +105.0°C)
Core 0:         +78.0°C  (high = +87.0°C, crit = +105.0°C)
Core 1:         +77.0°C  (high = +87.0°C, crit = +105.0°C)

pkg-temp-0-virtual-0
Adapter: Virtual device
temp1:        +78.0°C  

Tue Feb 18 12:17:42 CET 2014
acpitz-virtual-0
Adapter: Virtual device
temp1:        +77.0°C  (crit = +108.0°C)

asus-isa-0000
Adapter: ISA adapter
temp1:        +77.0°C  

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +79.0°C  (high = +87.0°C, crit = +105.0°C)
Core 0:         +79.0°C  (high = +87.0°C, crit = +105.0°C)
Core 1:         +75.0°C  (high = +87.0°C, crit = +105.0°C)

pkg-temp-0-virtual-0
Adapter: Virtual device
temp1:        +79.0°C

--> no issues with 3.12.8-300.fc20.x86_64

Comment 18 markusN 2014-02-23 23:58:30 UTC

No such issue with kernel 3.12.8-300.fc20.x86_64 
but starting with kernel 3.12.9-301.fc20.x86_64. Confirmed also
with kernel 3.13.3-201.fc20.x86_64

Comment 19 Joel Uckelman 2014-02-25 16:16:08 UTC

I started getting loads of these messages with kernel-3.13.3-201.

Comment 20 Justin M. Forbes 2014-05-21 19:40:24 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.14.4-200.fc20.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 21 Chris Murphy 2014-05-21 20:10:23 UTC

Yes it still happens with 3.14.4.

Comment 22 Mukundan Ragavan 2014-05-21 20:15:24 UTC

I see this as well with 3.14.4.

Comment 23 Chris Murphy 2014-07-01 03:31:03 UTC

Still happens with 3.16.0-0.rc2.git1.1.fc21.x86_64.

Comment 24 nmvega 2014-08-04 18:02:03 UTC

See also: https://bugzilla.redhat.com/show_bug.cgi?id=1050106

This was my issue on Fedora-19 and now still on Fedora-20 (the issue, which has similarity to this issue, was never resolved). Perhaps this additional information will be helpful to both bug filings.

Comment 25 Samuel Sieb 2014-08-05 01:37:14 UTC

Is this bug about the laptop running hot or the messages in the log?  I think the actual log messages are a new feature.  In my case, my laptop runs hot under just about any load, there's nothing new there.  But a few versions back, the kernel started reporting the MCE messages.  It would be nice if they could be disabled.  I know my laptop is hot, but I don't need constant log messages about it.  And worse, abrt keeps triggering on them, which is not helping the situation any...

Comment 26 T.C. Hollingsworth 2014-08-05 02:05:43 UTC

(In reply to Samuel Sieb from comment #25)
> Is this bug about the laptop running hot or the messages in the log?

This bug is about laptops running hot with kernels > 3.8 while older kernels work just fine.

IMHO this is a reasonable thing to log, but abrt really shouldn't go off every time, especially since it won't let you file a bug anyway.  >:-(  Might want to file a bug against abrt for that.

Comment 27 T.C. Hollingsworth 2014-08-05 02:23:50 UTC

Looking back on this I see I'm even more of an idiot than I thought; I switched this machine to use journal persistence before it was made default and that's why I never touched logrotate...

I can now confirm this also started when I upgraded to the 3.9.x kernels back when this machine ran F18.  Specifically, kernel-3.8.11-200.fc18.x86_64 -> kernel-3.9.11-200.fc18.x86_64.  (Yeah, I'm bad at updating sometimes.  ;-)

I have about eight months of logs with not a single complaint of temperature problems and then after I rebooted into that kernel the flood started and continues to this day.

Comment 28 Samuel Sieb 2014-08-05 08:10:42 UTC

Are you sure that temperature problems started then?  Or maybe that is just when the kernel started logging them?

There's a potentially relevant commit around that time at:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kernel/cpu/mcheck/therm_throt.c?id=25cdce170d28092e8e162f36702be3308973b19d

Comment 29 Chris Murphy 2014-08-05 17:38:10 UTC

I'm still seeing the temperature warnings with 3.16, and abrt is reporting dozens of mce events per day which seems much more aggressive than previously.

The original bug description is not specific enough. It was more about being confused if the messages are legitimate temperature warnings; if they are, why are they happening; if they are, are they a risk to the hardware? I ask because on all Macs I have, I get these warnings, they all run much hotter running Linux than OS X. And I've had 1 of 3 machines die while running hot, inexplicably. Since it's dead (no startup chime, and no boot manager comes up, i.e. is not making it to or not completing POST) I don't know if its death is coincidence or related to overheating.

Since I'm down to one Mac (one dead, one given away, one for day to day use) I'm reluctant to do further baremetal testing until I understand what the messages mean, what the risk assessement is, why the machines all appear to be overheating and only when running Linux.

Comment 30 Chris Murphy 2014-08-05 18:00:09 UTC

Created attachment 924266 [details]
dmesg 3.16.0-0.rc7.git4.1

Still the same system as originally reported: Apple Inc. MacBookPro8,2/Mac-94245A3940C91C80, BIOS    MBP81.88Z.0047.B27.1201241646 01/24/12

This dmesg captured during installation from USB stick made with Fedora-Live-LXDE-x86_64-21-20140804.iso.

Suspicious items possibly related to CPU or power management.

[    0.108157] perf_event_intel: PEBS disabled due to CPU errata, please upgrade microcode

I have no idea how to upgrade microcode.

[    3.243996] hpet: probe of PNP0103:00 failed with error -22

[    0.055391] CPU0: Thermal monitoring enabled (TM1)

I do not get this message for the other 7 CPUs, but CPU0-7 all have temperature above threshold messages, so this seems unrelated.

Anyway it seems to me something is wrong since it gets so hot, fans frequently go to max, and the kernel also reports high temps and mce events. So if there's a way to manually lower the CPU throttling threshold (kernel parameter maybe) at expense of performance, that would be a better-than-nothing work around. The current behavior is at best very undesirable, and at worst it might be burning up laptops.

Comment 31 Sergio Basto 2014-08-07 00:04:40 UTC

(In reply to Chris Murphy from comment #30)
> I have no idea how to upgrade microcode.

microcode_ctl-2:2.1-5.fc20.x86_64

Comment 32 Chris Murphy 2014-08-07 19:34:04 UTC

(In reply to Sergio Monteiro Basto from comment #31)
> (In reply to Chris Murphy from comment #30)
> > I have no idea how to upgrade microcode.
> 
> microcode_ctl-2:2.1-5.fc20.x86_64

I have microcode_ctl-2.1-6.fc21.x86_64 and yet I still get the message "PEBS disabled due to CPU errata, please upgrade microcode" so how do I upgrade the microcode when I already have the current version?

Comment 33 Chris Murphy 2014-08-08 00:55:52 UTC

Nevermind. Looks like it's being used this whole time.

[    0.046636] localhost.localdomain kernel: perf_event_intel: PEBS disabled due to CPU errata, please upgrade microcode

[snip]

[   17.972775] twenty1.localdomain kernel: perf_event_intel: PEBS enabled due to microcode update

Comment 34 Juha Heljoranta 2014-09-02 18:20:39 UTC

Disabling the intel turbo boost solved the problem for me.

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo 

# lscpu | grep "Model name"
Model name:            Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
# cpupower frequency-info -p
analyzing CPU 0:
800000 4000000 powersave
# cpupower frequency-info -d
analyzing CPU 0:
intel_pstate
# uname -r
3.15.10-201.fc20.x86_64

Comment 35 Aiden Bell 2014-09-13 18:02:06 UTC

Just a quick FYI on this, with my Macbook Pro I noticed the fans didn't come on to the same extent as OSX. It appears the fan control is broken, at least on my Fedora 20 install fully up to date. Lack of fans = frequency throttling when core temp goes above 85C

After looking in to it, I found a daemon to manage fans on Apple systems:

https://github.com/dgraziotin/Fan-Control-Daemon

I'm sure this lack of functionality is either a missing feature or bug upstream, but for those in this thread, my fans now spin-up correctly when running that daemon.

Have fun!

Comment 36 Amir 2014-11-07 07:39:11 UTC

https://access.redhat.com/solutions/35494 

Disabling the "C States" in the BIOS, so that the CPU is always running at full power.
hardware check did not reveal any issues. It was identified that C state in BIOS resulted in less power and this resulted in errors.

Comment 37 Jaroslav Reznik 2015-03-03 14:54:06 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle.
Changing version to '22'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22

Comment 38 Tommy Surbakti 2015-06-07 04:47:06 UTC

this bug still appears on F22 (fresh install)

uname -a
Linux lomok2 4.0.4-303.fc22.x86_64 #1 SMP Thu May 28 12:37:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

lscpu | grep "Model name"
Model name:            Intel(R) Core(TM) i3-2330M CPU @ 2.20GHz

Comment 39 Nils Philippsen 2015-06-15 12:12:29 UTC

Same here on a Lenovo ThinkPad T540p (upgraded installaion):

uname -a
Linux gibraltar 4.0.4-303.fc22.x86_64 #1 SMP Thu May 28 12:37:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

lscpu | grep "Model name"
Model name:            Intel(R) Core(TM) i7-4900MQ CPU @ 2.80GHz

Comment 40 Tomer Barletz 2015-08-03 17:25:01 UTC

I am seeing this on F22 with a vanilla kernel.org kernel (4.2.0-rc4), so it looks like this issue is not a regression in the Fedora kernel, but rather a more general one off of mainline.

Comment 41 James 2015-08-13 11:17:47 UTC

I am having the same problem in F22 with kernel 4.1.3-200.fc22.x86_64 on a Lenovo W550s. 


====================================================================
Aug 13 14:09:56 localhost kernel: mce: [Hardware Error]: Machine check events logged
Aug 13 14:09:56 localhost mcelog: Hardware event. This is not a software error.
Aug 13 14:09:56 localhost mcelog: MCE 0
Aug 13 14:09:56 localhost mcelog: CPU 2 THERMAL EVENT TSC 5a1094fd777
Aug 13 14:09:56 localhost mcelog: TIME 1439464080 Thu Aug 13 14:08:00 2015
Aug 13 14:09:56 localhost mcelog: Processor 2 heated above trip temperature. Throttling enabled.
Aug 13 14:09:56 localhost mcelog: Please check your system cooling. Performance will be impacted
Aug 13 14:09:56 localhost mcelog: STATUS 88200803 MCGSTATUS 0
Aug 13 14:09:56 localhost mcelog: MCGCAP 1000c07 APICID 2 SOCKETID 0
Aug 13 14:09:56 localhost mcelog: CPUID Vendor Intel Family 6 Model 61
Aug 13 14:09:56 localhost mcelog: Hardware event. This is not a software error.
Aug 13 14:09:56 localhost mcelog: MCE 1
Aug 13 14:09:56 localhost mcelog: CPU 3 THERMAL EVENT TSC 5a1095025dd
Aug 13 14:09:56 localhost mcelog: TIME 1439464080 Thu Aug 13 14:08:00 2015
Aug 13 14:09:56 localhost mcelog: Processor 3 heated above trip temperature. Throttling enabled.
Aug 13 14:09:56 localhost mcelog: Please check your system cooling. Performance will be impacted
Aug 13 14:09:56 localhost mcelog: STATUS 88200803 MCGSTATUS 0
Aug 13 14:09:56 localhost mcelog: MCGCAP 1000c07 APICID 3 SOCKETID 0
Aug 13 14:09:56 localhost mcelog: CPUID Vendor Intel Family 6 Model 61
Aug 13 14:09:56 localhost mcelog: Hardware event. This is not a software error.
Aug 13 14:09:56 localhost mcelog: MCE 2
Aug 13 14:09:56 localhost mcelog: CPU 3 THERMAL EVENT TSC 5a109990284
Aug 13 14:09:56 localhost mcelog: TIME 1439464080 Thu Aug 13 14:08:00 2015
Aug 13 14:09:56 localhost mcelog: Processor 3 below trip temperature. Throttling disabled
Aug 13 14:09:56 localhost mcelog: STATUS 88210802 MCGSTATUS 0
Aug 13 14:09:56 localhost mcelog: MCGCAP 1000c07 APICID 3 SOCKETID 0
Aug 13 14:09:56 localhost mcelog: CPUID Vendor Intel Family 6 Model 61
Aug 13 14:09:56 localhost mcelog: Hardware event. This is not a software error.
Aug 13 14:09:56 localhost mcelog: MCE 3
Aug 13 14:09:56 localhost mcelog: CPU 2 THERMAL EVENT TSC 5a10999272f
Aug 13 14:09:56 localhost mcelog: TIME 1439464080 Thu Aug 13 14:08:00 2015
Aug 13 14:09:56 localhost mcelog: Processor 2 below trip temperature. Throttling disabled
Aug 13 14:09:56 localhost mcelog: STATUS 88210802 MCGSTATUS 0
Aug 13 14:09:56 localhost mcelog: MCGCAP 1000c07 APICID 2 SOCKETID 0
Aug 13 14:09:56 localhost mcelog: CPUID Vendor Intel Family 6 Model 61
=======================================================================

I am monitoring CPU core temperatures with lm_sensors and xfce4-sensors-plugin. The temperatures seem to be varying between 46C and 58C.

Comment 42 Christoph Höger 2015-08-17 07:41:38 UTC

Same here with a 3rd gen X1 Carbon. 

1 choeger@oxide ~ % cat /proc/cpuinfo | grep "model name"                                                                                                                                                           model name	: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
model name	: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
model name	: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
model name	: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz

I can trigger this behaviour by running

md5sum /dev/urandom

this uses one core at 100% and should not cause thermal issues IMO. So it seems either fancontrol or turbo boost goes awry.

I also noticed the following during the stress test:

choeger@oxide ~ % cat /proc/cpuinfo | grep MH
cpu MHz		: 3100.398
cpu MHz		: 3100.093
cpu MHz		: 3199.218
cpu MHz		: 3143.054

I am not an expert regarding CPUs, but given just one job at 100%, shouldn't the other cores be idle? It seems as if _all_ cores go into turbo mode (which makes thermal problems quite likely).

Comment 43 James 2015-08-18 13:46:47 UTC

I can confirm the same as Christoph Höger above: all cores going into turbo mode with 

md5sum /dev/urandom

To protect my computer I adopted the (hopefully) temporary solution by Juha Heljoranta above, that is, added the following line into /etc/rc.d/rc.local

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

I have not seen the error since even though I tried to reproduce it. Also, as expected, none of the cores now go into turbo mode.

Comment 44 Justin M. Forbes 2015-10-20 19:39:34 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs.

Fedora 22 has now been rebased to 4.2.3-200.fc22.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23.

If you experience different issues, please open a new bug report for those.

Comment 45 Tomer Barletz 2015-10-20 19:59:48 UTC

Still seen on 4.2.3-200.fc22.x86_64 running on a ThinkPad T440p (i7-4600M).

Comment 46 Persona non grata 2015-10-21 19:35:33 UTC

Still failing in F22 i7-5600U

Comment 47 Nils Philippsen 2015-10-23 15:25:32 UTC

Still happens on 4.2.3-300.fc23.x86_64, same hardware as I mentioned in comment #39.

Comment 48 OE1FEU 2015-10-26 13:22:11 UTC

Still happens on 4.0.4-301.fc22.x86_64 on a Thinkpad X1 Carbon 3rd gen.

However, the solution as posted in https://bugzilla.redhat.com/show_bug.cgi?id=924570#c34 works fine here, so it's obviously related to the Intel Turbo Boost. After deactivating this I can run 4 cores at full load with out any MCE triggered.

Comment 49 Pavel Kajaba 2015-11-13 06:58:03 UTC

Still happens on 4.2.5-300.fc23.x86_64 on Thinkpad X1 Carbon 3rd gen i7-5600U

Comment 50 Marcel Wysocki 2015-11-21 14:41:14 UTC

Same problem here on X1 Carbon 3rd gen.

Comment 51 Jonas Nordell 2016-01-04 13:44:24 UTC

I'm having this problem on a X1 Carbon 3rd gen and Fedora 23

# dmidecode -t system | grep Version
	Version: ThinkPad X1 Carbon 3rd

# uname -a
Linux jmoon 4.2.8-300.fc23.x86_64 #1 SMP Tue Dec 15 16:49:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Comment 52 Chris Murphy 2016-01-04 17:27:50 UTC

This is a sufficiently old bug with no response from kernel maintainers as to what it means, what the user should do about it, that it probably needs to go upstream and asked on lkml. Because right now the warnings are sufficiently dire with no work around other than to not use linux at all on the hardware. Clearly something is wrong if the manufacturer's diagnostics say the hardware is OK, and yet the kernel is claiming (inspecific) hardware errors, but only after the CPU is allowed to get too hot. I don't get such overheating and crazy fan speeds running either OS X or Windows on the same hardware, so at the moment I think the burden is on kernel and cpu microcode experts to say what these messages mean.

Comment 53 Michael Adam 2016-01-12 10:35:46 UTC

Problem exists on X240 with Latest F23.

Comment 54 Michael Adam 2016-01-12 10:56:40 UTC

Also exists on Thinkpad W541 with latest F23.

Comment 55 Cagney 2016-03-23 01:10:12 UTC

i5-2467M Samsung 530U3 laptop, running 4.4.5-300.fc23.x86-64

echo 1 > ...../intel_pstate/no_turbo

work around works.

Comment 56 Sean Flanigan 2016-03-29 01:55:49 UTC

This sounds like it might be related: https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.6-Thermal-Updates

I hope there's a back-port soon.

Comment 57 Fedora End Of Life 2016-07-19 10:07:59 UTC

Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 58 Cagney 2016-09-09 15:25:37 UTC

Still occurs in f24, and see also the recently opened bugs #1301739 #1373881 #1284144

Comment 59 Chris Murphy 2016-09-09 15:37:33 UTC

Yep it's present through 4.8.0-0.rc5.git1.1.fc25.x86_64 and then I also get four of these:


Sep 07 10:08:01 f24m mcelog[824]: Hardware event. This is not a software error.
Sep 07 10:08:01 f24m mcelog[824]: MCE 0
Sep 07 10:08:01 f24m mcelog[824]: CPU 5 THERMAL EVENT TSC a8b749118
Sep 07 10:08:01 f24m mcelog[824]: TIME 1473264479 Wed Sep  7 10:07:59 2016
Sep 07 10:08:01 f24m mcelog[824]: Processor 5 heated above trip temperature. Throttling enabled.
Sep 07 10:08:01 f24m mcelog[824]: Please check your system cooling. Performance will be impacted
Sep 07 10:08:01 f24m mcelog[824]: STATUS 880003c3 MCGSTATUS 0
Sep 07 10:08:01 f24m mcelog[824]: MCGCAP c09 APICID 3 SOCKETID 0
Sep 07 10:08:01 f24m mcelog[824]: CPUID Vendor Intel Family 6 Model 42

Comment 60 David Sommerseth 2016-10-13 11:29:32 UTC

This is also seen on Scientific Linux 7.2 (RHEL 7.2 clone) with 3.10.0-327.36.1.el7.x86_64 on ThinkPad T460s

------------------------------------------------------------
Oct 13 13:21:49 aurelius mcelog: Hardware event. This is not a software error.
Oct 13 13:21:49 aurelius mcelog: MCE 0
Oct 13 13:21:49 aurelius mcelog: CPU 2 THERMAL EVENT TSC 1e7192b3a89c
Oct 13 13:21:49 aurelius mcelog: TIME 1476357675 Thu Oct 13 13:21:15 2016
Oct 13 13:21:49 aurelius mcelog: Processor 2 heated above trip temperature. Throttling enabled.
Oct 13 13:21:49 aurelius mcelog: Please check your system cooling. Performance will be impacted
Oct 13 13:21:49 aurelius mcelog: STATUS 8809080b MCGSTATUS 0
Oct 13 13:21:49 aurelius mcelog: MCGCAP 1000c07 APICID 2 SOCKETID 0
Oct 13 13:21:49 aurelius mcelog: CPUID Vendor Intel Family 6 Model 61
Oct 13 13:21:49 aurelius mcelog: Hardware event. This is not a software error.
Oct 13 13:21:49 aurelius mcelog: MCE 1
Oct 13 13:21:49 aurelius mcelog: CPU 3 THERMAL EVENT TSC 1e7192b3ebc2
Oct 13 13:21:49 aurelius mcelog: TIME 1476357675 Thu Oct 13 13:21:15 2016
Oct 13 13:21:49 aurelius mcelog: Processor 3 heated above trip temperature. Throttling enabled.
Oct 13 13:21:49 aurelius mcelog: Please check your system cooling. Performance will be impacted
Oct 13 13:21:49 aurelius mcelog: STATUS 8809080b MCGSTATUS 0
Oct 13 13:21:49 aurelius mcelog: MCGCAP 1000c07 APICID 3 SOCKETID 0
Oct 13 13:21:49 aurelius mcelog: CPUID Vendor Intel Family 6 Model 61
Oct 13 13:21:49 aurelius mcelog: Hardware event. This is not a software error.
Oct 13 13:21:49 aurelius mcelog: MCE 2
Oct 13 13:21:49 aurelius mcelog: CPU 2 THERMAL EVENT TSC 1e7192dab9ac
Oct 13 13:21:49 aurelius mcelog: TIME 1476357675 Thu Oct 13 13:21:15 2016
Oct 13 13:21:49 aurelius mcelog: Processor 2 below trip temperature. Throttling disabled
Oct 13 13:21:49 aurelius mcelog: STATUS 880a080a MCGSTATUS 0
Oct 13 13:21:49 aurelius mcelog: MCGCAP 1000c07 APICID 2 SOCKETID 0
Oct 13 13:21:49 aurelius mcelog: CPUID Vendor Intel Family 6 Model 61
Oct 13 13:21:49 aurelius mcelog: Hardware event. This is not a software error.
Oct 13 13:21:49 aurelius mcelog: MCE 3
Oct 13 13:21:49 aurelius mcelog: CPU 3 THERMAL EVENT TSC 1e7192dae606
Oct 13 13:21:49 aurelius mcelog: TIME 1476357675 Thu Oct 13 13:21:15 2016
Oct 13 13:21:49 aurelius mcelog: Processor 3 below trip temperature. Throttling disabled
Oct 13 13:21:49 aurelius mcelog: STATUS 880a080a MCGSTATUS 0
Oct 13 13:21:49 aurelius mcelog: MCGCAP 1000c07 APICID 3 SOCKETID 0
Oct 13 13:21:49 aurelius mcelog: CPUID Vendor Intel Family 6 Model 61
-------------------------------------------------------

vendor_id	: GenuineIntel
cpu family	: 6
model		: 61
model name	: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
stepping	: 4

-------------------------------------------------------

kernel-3.10.0-327.36.1.el7.x86_64
microcode_ctl-2.1-12.el7_2.1.x86_64
mcelog-120-3.e7e0ac1.el7.x86_64


Trying the /sys/devices/system/cpu/intel_pstate/no_turbo workaround to see if that helps, I presume it will.

Comment 61 Dan Yasny 2016-12-20 19:50:35 UTC

Issue consistently reproduces on two different X1 carbon laptops (types 20BT and 20A7), across Fedora 22-23-24 and now 25 too. mcelog dmesg are consistent across the two machines, all the Fedora versions and the outputs already posted above.

Comment 62 Brad Hubbard 2016-12-20 20:13:13 UTC

Still happening for me on ThinkPad W541 4.8.14-300.fc25.x86_64

Comment 63 Paulo Fidalgo 2016-12-29 16:14:57 UTC

Well unfortunately I'm also experienced this issue, I've installed 
https://github.com/dgraziotin/Fan-Control-Daemon
and opened a bug asking for the work needed to do to get this into a upstream project:
https://github.com/dgraziotin/mbpfan/issues/99

I'm not sure of what upstream project it would be, but at least let's open the discussion, because in the current state the experience is sub-optimal in Macs.

Comment 64 Dan Yasny 2016-12-29 16:20:57 UTC

(In reply to Paulo Fidalgo from comment #63)
> Well unfortunately I'm also experienced this issue, I've installed 
> https://github.com/dgraziotin/Fan-Control-Daemon
> and opened a bug asking for the work needed to do to get this into a
> upstream project:
> https://github.com/dgraziotin/mbpfan/issues/99
> 
> I'm not sure of what upstream project it would be, but at least let's open
> the discussion, because in the current state the experience is sub-optimal
> in Macs.

Thanks for the link, but it seems like it's specific for macbooks. This BZ talks mostly about IBM/Lenovo laptops.


BTW, I have the issue mitigated temporarily by disabling turbo boost:

cores=$(cat /proc/cpuinfo | grep processor | awk '{print $3}')
for core in $cores; do        
    sudo wrmsr -p${core} 0x1a0 0x4000850089
    state=$(sudo rdmsr -p${core} 0x1a0 -f 38:38)
    if [[ $state -eq 1 ]]; then
        echo "core ${core}: disabled"
    else
        echo "core ${core}: enabled"
    fi
done

However, this isn't a solution, or even a workaround, just an ugly hack

Comment 65 Chris Murphy 2016-12-29 17:09:18 UTC

I recommend thermald. There's an older version in copr that should still work, but lately I'm just building it myself from upstream.
https://github.com/01org/thermal_daemon

Comment 66 Laura Abbott 2017-01-17 01:22:32 UTC

*********** MASS BUG UPDATE **************
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 25 kernel bugs.
 
Fedora 25 has now been rebased to 4.9.3-200.fc25.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.
 
If you experience different issues, please open a new bug report for those.

Comment 67 Paulo Fidalgo 2017-01-17 10:54:17 UTC

At least for Macbook Pro 12,1 the bug is still present in the kernel 4.9.3-200.fc25.x86_64. I've disabled the mbpfan service and after a while I've started to see the messages related to high temperatures.

Comment 68 David Orman 2017-01-22 16:24:47 UTC

Linux hostnamehere 4.9.4-201.fc25.x86_64 #1 SMP Tue Jan 17 18:58:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Model name:            Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

System: Lenovo X1 Carbon 4th generation

Disabling turbo-boost via the proc mentioned above results in no core ever going higher than 800MHz which is a non-starter. With turbo-boost enabled:

No md5sum /dev/urandom:

[ormandj@ormandj-laptop ~]$ cat /proc/cpuinfo |grep MHz
cpu MHz		: 496.972
cpu MHz		: 476.123
cpu MHz		: 451.513
cpu MHz		: 431.860

md5sum /dev/urandom:

[ormandj@ormandj-laptop ~]$ cat /proc/cpuinfo |grep MHz
cpu MHz		: 3387.548
cpu MHz		: 3190.161
cpu MHz		: 3399.511
cpu MHz		: 3151.196

CPU | sys      80%  | user     25%  | irq       1%  |               | idle    296%  | wait      1% |               |  steal     0% |  guest     0% |  curf 3.19GHz |  curscal  93% |
cpu | sys      78%  | user     21%  | irq       1%  |               | idle      0%  | cpu000 w  0% |               |  steal     0% |  guest     0% |  curf 3.40GHz |  curscal  99% |
cpu | sys	2%  | user      3%  | irq       0%  |               | idle     95%  | cpu001 w  0% |               |  steal     0% |  guest     0% |  curf 3.05GHz |  curscal  89% |
cpu | sys	0%  | user      2%  | irq       0%  |               | idle     98%  | cpu003 w  0% |               |  steal     0% |  guest     0% |  curf 3.10GHz |  curscal  91% |

However, I'm not seeing the thermal run-away issue re: MCE logged or alerting. The fans spin up to max and even after letting it run like this for 5 minutes, I've seen no alerts in dmesg.

On the other hand, if I use KVM/virtualization and tax the cpu, I'll see:

[ 3678.935724] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x611
[ 3678.935734] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x639
[ 3678.935738] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x641
[ 3678.935742] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x619
[ 3678.992547] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x611
[ 3678.992556] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x639
[ 3678.992561] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x641
[ 3678.992566] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x619
[ 3678.997241] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x60d
[ 3678.997247] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x3f8
[ 3849.664358] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 3849.664359] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 3849.664360] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3849.664361] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3849.664363] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3849.664367] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3849.664367] mce: [Hardware Error]: Machine check events logged
[ 3849.664369] mce: [Hardware Error]: Machine check events logged
[ 3849.665295] CPU2: Core temperature/speed normal
[ 3849.665295] CPU0: Core temperature/speed normal
[ 3849.665296] CPU3: Package temperature/speed normal
[ 3849.665297] CPU1: Package temperature/speed normal
[ 3849.665298] CPU0: Package temperature/speed normal
[ 3849.665299] CPU2: Package temperature/speed normal

In my case, at least, it appears the problem only occurs when using KVM for virtualization at this point with this version of the kernel.

Comment 69 Dan Yasny 2017-01-22 17:37:58 UTC

I do not use KVM to reproduce the issue, a large number of chrome tabs is enough, especially when using something like bluejeans web video conferencing

Comment 70 Chris Murphy 2017-02-22 18:38:24 UTC

Still a problem with 4.10.0-0.rc8.git0.1.fc26.x86_64 on MacbookPro 8,2

Comment 71 Matt 2017-02-28 02:33:17 UTC

It may be fixed Kernel 4.11 - I noticed that there were some P-State changes in the pull requests. Have installed the vanilla mainline listed on the Fedora wiki (rc0.git4) and have not been able to make an MCE log error generate (I usually get a couple every hour during a work day)

Comment 72 Chris Murphy 2017-02-28 05:35:02 UTC

I see it during Fedora installs and I'm not seeing it with Fedora-Workstation-Live-x86_64-Rawhide-20170226.n.0.iso which has 4.11.0-0.rc0.git4.1.fc26.x86_64.

Comment 73 Nils Philippsen 2017-03-21 09:54:21 UTC

Still see it with 4.10.4-200.fc25.x86_64.

Comment 74 Marcel Wysocki 2017-03-21 10:47:19 UTC

I can confirm issue seems to be gone using the kernel from
https://dl.fedoraproject.org/pub/alt/rawhide-kernel-nodebug/x86_64

Comment 75 Justin Chiu 2017-04-29 08:58:03 UTC

I can also confirm that the issue has gone away with the latest rawhide kernel (4.11-rc8 at the time of writing), on Fedora 25. I am using a Thinkpad T470 with a i7-7600U. Almost any load -- browsing with Firefox, mprime, VMs -- would cause the messages to appear.

Comment 76 jesse 2017-05-01 13:12:06 UTC

I still see the issue with 4.10.13-200.fc25.x86_64. 

kernel 4.11.0-0.rc8.git4.1.fc27 did resolve the MCE issue though there still were periodic thermal throttling messages logged.

One note of caution - with the 4.11 kernel I was experiencing random BTRFS filesystem errors that would cause the filesystem to remount read-only. This happened multiple times. After each error I would reboot and scrub the filesystem to check for errors. None were found. I'm running BTRFS in a non-raid configuration on an encrypted partition.

I reverted to 4.10 and haven't experienced any BTRFS issues.

Comment 77 Nils Philippsen 2017-05-29 09:32:54 UTC

As with Jesse in comment #76, this seems to have resolved the MCE issue leaving the termal throttling messages:

[  232.898650] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
[  232.898651] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
[  232.898652] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
[  232.898655] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[  232.898656] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
[  232.898657] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
[  232.898658] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
[  232.898659] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[  232.898660] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[  232.898666] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[  232.899629] CPU1: Core temperature/speed normal
[  232.899629] CPU0: Core temperature/speed normal
[  232.899630] CPU2: Package temperature/speed normal
[  232.899631] CPU3: Package temperature/speed normal
[  232.899632] CPU6: Package temperature/speed normal
[  232.899633] CPU4: Package temperature/speed normal
[  232.899634] CPU5: Package temperature/speed normal
[  232.899634] CPU7: Package temperature/speed normal
[  232.899635] CPU0: Package temperature/speed normal
[  232.899636] CPU1: Package temperature/speed normal

This is on a Thinkpad T540p, i7-4900MQ running Fedora 25 with kernel-4.11.3-200.fc25.x86_64 from updates-testing, when building a fairly large C++ project with -j7 (build time a little over 4 mins, not cleaned out beforehand). I watched the frequencies of the cores and temperature while the build was running, frequencies went up to 3.8G then down again when temperature approached the criticial limit of 100°C. Just this one block of messages was logged shortly after the build was started.

Comment 78 Nils Philippsen 2017-05-29 09:59:33 UTC

I repeated the build with a clean out source repo, it took a bit more than 13 minutes and the throttling message came almost precisely every 5 minutes/300 seconds (looks like log throttling to me):

[ 4456.634244] CPU1: Core temperature above threshold, cpu clock throttled (total events = 36076)
[ 4456.634244] CPU0: Core temperature above threshold, cpu clock throttled (total events = 36076)
[ 4456.634246] CPU3: Package temperature above threshold, cpu clock throttled (total events = 43249)
[ 4456.634249] CPU7: Package temperature above threshold, cpu clock throttled (total events = 43249)
[ 4456.634250] CPU2: Package temperature above threshold, cpu clock throttled (total events = 43249)
[ 4456.634251] CPU4: Package temperature above threshold, cpu clock throttled (total events = 43249)
[ 4456.634252] CPU6: Package temperature above threshold, cpu clock throttled (total events = 43249)
[ 4456.634253] CPU5: Package temperature above threshold, cpu clock throttled (total events = 43249)
[ 4456.634254] CPU0: Package temperature above threshold, cpu clock throttled (total events = 43249)
[ 4456.634260] CPU1: Package temperature above threshold, cpu clock throttled (total events = 43249)
[ 4456.635249] CPU0: Core temperature/speed normal
[ 4456.635250] CPU1: Core temperature/speed normal
[ 4456.635251] CPU2: Package temperature/speed normal
[ 4456.635252] CPU5: Package temperature/speed normal
[ 4456.635253] CPU4: Package temperature/speed normal
[ 4456.635254] CPU7: Package temperature/speed normal
[ 4456.635255] CPU6: Package temperature/speed normal
[ 4456.635255] CPU3: Package temperature/speed normal
[ 4456.635256] CPU1: Package temperature/speed normal
[ 4456.635256] CPU0: Package temperature/speed normal
[ 4756.651133] CPU1: Core temperature above threshold, cpu clock throttled (total events = 77049)
[ 4756.651134] CPU0: Core temperature above threshold, cpu clock throttled (total events = 77049)
[ 4756.651136] CPU2: Package temperature above threshold, cpu clock throttled (total events = 119697)
[ 4756.651138] CPU5: Package temperature above threshold, cpu clock throttled (total events = 119697)
[ 4756.651139] CPU4: Package temperature above threshold, cpu clock throttled (total events = 119697)
[ 4756.651142] CPU7: Package temperature above threshold, cpu clock throttled (total events = 119697)
[ 4756.651143] CPU6: Package temperature above threshold, cpu clock throttled (total events = 119697)
[ 4756.651144] CPU3: Package temperature above threshold, cpu clock throttled (total events = 119697)
[ 4756.651145] CPU0: Package temperature above threshold, cpu clock throttled (total events = 119697)
[ 4756.651152] CPU1: Package temperature above threshold, cpu clock throttled (total events = 119697)
[ 4756.652136] CPU0: Core temperature/speed normal
[ 4756.652137] CPU1: Core temperature/speed normal
[ 4756.652138] CPU5: Package temperature/speed normal
[ 4756.652139] CPU3: Package temperature/speed normal
[ 4756.652140] CPU2: Package temperature/speed normal
[ 4756.652141] CPU6: Package temperature/speed normal
[ 4756.652142] CPU7: Package temperature/speed normal
[ 4756.652143] CPU4: Package temperature/speed normal
[ 4756.652143] CPU1: Package temperature/speed normal
[ 4756.652144] CPU0: Package temperature/speed normal
[ 5027.128441] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 5056.675053] CPU1: Core temperature/speed normal
[ 5056.675053] CPU0: Core temperature/speed normal
[ 5056.675055] CPU4: Package temperature/speed normal
[ 5056.675056] CPU6: Package temperature/speed normal
[ 5056.675058] CPU2: Package temperature/speed normal
[ 5056.675059] CPU5: Package temperature/speed normal
[ 5056.675060] CPU3: Package temperature/speed normal
[ 5056.675061] CPU7: Package temperature/speed normal
[ 5056.675061] CPU0: Package temperature/speed normal
[ 5056.675062] CPU1: Package temperature/speed normal
[ 5061.001186] CPU2: Core temperature above threshold, cpu clock throttled (total events = 100482)
[ 5061.001187] CPU3: Core temperature above threshold, cpu clock throttled (total events = 100482)
[ 5061.002181] CPU2: Core temperature/speed normal
[ 5061.002182] CPU3: Core temperature/speed normal

Still no MCE errors :).

Comment 79 Bill Perkins 2017-08-11 22:16:42 UTC

I have this very same type of the problem.  I have two J1900 mini-ITX systems that I got from two different vendors in China.  I also had a J1800 mini-ITX that I initially had before I sent its motherboard back to exchange it for a J1900 motherboard.  The J1800 would throttle back and not return to normal speed until it was rebooted.  The J1900s will indicate that they are throttling back and then in the same or next second would indicate that the speed had returned to normal.  All three of these systems will freeze after running for one hour, several hours or even for some days.  They will freeze eventually in all cases.

The two J1900s are both running Fedora 25 and are properly updated to current packages.

uname -a
Linux gandalf.localnet 4.11.12-200.fc25.x86_64 #1 SMP Fri Jul 21 16:41:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 55
Model name:            Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
Stepping:              8
CPU MHz:               1999.200
CPU max MHz:           1999.2000
CPU min MHz:           1332.8000
BogoMIPS:              3998.40
Virtualization:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm arat

$ lspci
00:00.0 Host bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx Series SoC Transaction Register (rev 0e)
00:02.0 VGA compatible controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Graphics & Display (rev 0e)
00:13.0 SATA controller: Intel Corporation Atom Processor E3800 Series SATA AHCI Controller (rev 0e)
00:14.0 USB controller: Intel Corporation Atom Processor Z36xxx/Z37xxx, Celeron N2000 Series USB xHCI (rev 0e)
00:1a.0 Encryption controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Trusted Execution Engine (rev 0e)
00:1b.0 Audio device: Intel Corporation Atom Processor Z36xxx/Z37xxx Series High Definition Audio Controller (rev 0e)
00:1c.0 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 1 (rev 0e)
00:1c.1 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 2 (rev 0e)
00:1c.2 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 3 (rev 0e)
00:1c.3 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 4 (rev 0e)
00:1f.0 ISA bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Power Control Unit (rev 0e)
00:1f.3 SMBus: Intel Corporation Atom Processor E3800 Series SMBus Controller (rev 0e)
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)

A slice example from /var/log/messages

Aug 11 18:12:02 gandalf kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 24)
Aug 11 18:12:02 gandalf kernel: CPU1: Core temperature above threshold, cpu clock throttled (total events = 24)
Aug 11 18:12:02 gandalf kernel: CPU2: Core temperature above threshold, cpu clock throttled (total events = 24)
Aug 11 18:12:05 gandalf kernel: CPU3: Core temperature above threshold, cpu clock throttled (total events = 24)
Aug 11 18:12:05 gandalf kernel: CPU0: Core temperature/speed normal
Aug 11 18:12:05 gandalf kernel: CPU1: Core temperature/speed normal
Aug 11 18:12:05 gandalf kernel: CPU2: Core temperature/speed normal
Aug 11 18:12:05 gandalf kernel: CPU3: Core temperature/speed normal

$ uptime
 18:15:06 up  2:58, 13 users,  load average: 1.84, 1.63, 1.53

$ sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +26.8°C  (crit = +90.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:       +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:       +34.0°C  (high = +105.0°C, crit = +105.0°C)

Comment 81 Bill Perkins 2017-09-20 01:32:20 UTC

I upgraded the two J1900 systems I mention in Comment 79 from Fedora 25 to Fedora 26.  I still see the temperature above threshold errors in the messages file, but neither system hangs any more.  Go figure!

Comment 82 Paulo Fidalgo 2018-06-15 04:37:17 UTC

Today, with 4.16.14-300.fc28.x86_64 and a Dell XPS with a Intel(R) Core(TM) i7-8550U CPU I still have this issue.
The system does not hand, but I have a lot of messages in dmesg.
[19655.510079] CPU7: Package temperature above threshold, cpu clock throttled (total events = 212)
[19655.510080] CPU3: Package temperature above threshold, cpu clock throttled (total events = 212)
[19655.510082] CPU1: Package temperature above threshold, cpu clock throttled (total events = 212)
[19655.510085] CPU5: Package temperature above threshold, cpu clock throttled (total events = 212)
[19655.510112] CPU0: Package temperature above threshold, cpu clock throttled (total events = 212)
[19655.510113] CPU6: Package temperature above threshold, cpu clock throttled (total events = 212)
[19655.510113] CPU4: Package temperature above threshold, cpu clock throttled (total events = 212)
[19655.510114] CPU2: Package temperature above threshold, cpu clock throttled (total events = 212)

Comment 83 Stan King 2019-07-21 19:25:25 UTC

I'm seeing this with kernel 5.1.18-300.fc30.x86_64 on a fanless Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, in a Zotac CI660.

The messages seem to come from the kernel, as mcelog doesn't record any problem.

It seems a bit flaky, as I can run tests with "stress -c N" on other Intel CPUs, observe their throttling down with "grep MHz </proc/cpuinfo", but without kernel spamming like this.

Here's a typical message, issued for each logical CPU, each time it happens:

kernel: mce: CPU1: Package temperature/speed normal

The system has hung once, but not under abnormally high load, just the screensaver under xfce.

Comment 84 Tomer Barletz 2022-06-16 09:17:43 UTC

Haven't experienced this in a while. Currently on Fedora 36 (5.17.13).
Recommend closing this 9-years old bug (which I reproed 7 years ago).