Bug 924570
Summary: | regression, package temp above normal induced mce | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Chris Murphy <bugzilla> | ||||||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||||||
Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | rawhide | CC: | aiden449, angystardust, barletz, bhubbard, bruno.cornec, bugzilla, bugzilla, cagney, choeger, c.justin88, dazo, dgsiegel, dr.diesel, eminguez, emmanuel.kowalski, euroelessar, fabrice, gansalmon, herrold, itamar, jarmofin, jesse, jnordell, john.mora, jonathan, jpittman, juha.heljoranta, kerncece, kernel-maint, konstantinos.smanis, lantw44, madam, madhu.chinakonda, mirosiko, mlombard, m.mcnutt, neteler, nfink95, nobody+385537, nphilipp, oholy, ormandj, paulo.fidalgo.pt, pcfe, peter, samuel-rhbugs, sean+rh, sergio, stanley.king, tadej.j, tchollingsworth, tflink, tommy, uckelman, vlee, vromanso, wmp | ||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2016-07-19 10:07:59 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Chris Murphy
2013-03-22 05:51:35 UTC
Created attachment 714304 [details]
dmesg
full dmesg. snippet with errors:
[ 3403.381085] CPU6: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381087] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381088] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381091] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381094] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381119] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381120] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381122] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381124] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.381496] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 3403.382109] CPU0: Package temperature/speed normal
[ 3403.382111] CPU4: Package temperature/speed normal
[ 3403.382114] CPU5: Package temperature/speed normal
[ 3403.382115] CPU3: Package temperature/speed normal
[ 3403.382117] CPU7: Package temperature/speed normal
[ 3403.382118] CPU1: Package temperature/speed normal
[ 3403.382119] CPU6: Core temperature/speed normal
[ 3403.382120] CPU2: Core temperature/speed normal
[ 3403.382120] CPU6: Package temperature/speed normal
[ 3403.382121] CPU2: Package temperature/speed normal
[ 3598.152029] mce: [Hardware Error]: Machine check events logged
Created attachment 714306 [details]
syslog
snippet from syslog
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.11.1-200.fc19. Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. Still present with 3.11.2-301.fc20.x86_64 Created attachment 806241 [details]
dmesg 3.11.2-301.fc20.x86_64
Happens in 3.12.9-301.fc20.x86_64 - in previous kernels I did not have such problems when compiling source code: [136510.805360] CPU2: Core temperature above threshold, cpu clock throttled (total events = 112479) [136510.805362] CPU0: Core temperature above threshold, cpu clock throttled (total events = 112478) [136510.805366] CPU3: Package temperature above threshold, cpu clock throttled (total events = 157878) [136510.805368] CPU1: Package temperature above threshold, cpu clock throttled (total events = 157878) [136510.805369] CPU0: Package temperature above threshold, cpu clock throttled (total events = 157877) [136510.805382] CPU2: Package temperature above threshold, cpu clock throttled (total events = 157878) [136510.807341] CPU2: Core temperature/speed normal [136510.807343] CPU0: Core temperature/speed normal [136510.807345] CPU1: Package temperature/speed normal [136510.807346] CPU3: Package temperature/speed normal [136510.807347] CPU0: Package temperature/speed normal [136510.807357] CPU2: Package temperature/speed normal [136606.148339] mce: [Hardware Error]: Machine check events logged Created attachment 863832 [details]
journalctl -b for kernel 3.14 rc2
This still happens with 3.14.0-0.rc2.git0.1.fc21.x86_64. It's pretty much always triggered by yum or dnf getting busy and making the laptop hot while sounding like it's a hair dryer.
What make me think this is bogus is that the trip temperature is exceeded at 900.69 seconds and then is below trip temperature at 900.70 seconds.
[ 901.969534] f20c.localdomain kernel: mce: [Hardware Error]: Machine check events logged
[ 900.690190] f20c.localdomain mcelog[581]: Hardware event. This is not a software error.
[ 900.691170] f20c.localdomain mcelog[581]: MCE 0
[ 900.691854] f20c.localdomain mcelog[581]: CPU 1 THERMAL EVENT TSC 1693342dfcc
[ 900.692551] f20c.localdomain mcelog[581]: TIME 1392524773 Sat Feb 15 21:26:13 2014
[ 900.693256] f20c.localdomain mcelog[581]: Processor 1 heated above trip temperature. Throttling enabled.
[ 900.693927] f20c.localdomain mcelog[581]: Please check your system cooling. Performance will be impacted
[ 900.694641] f20c.localdomain mcelog[581]: STATUS 880003c3 MCGSTATUS 0
[ 900.695337] f20c.localdomain mcelog[581]: MCGCAP c09 APICID 2 SOCKETID 0
[ 900.696015] f20c.localdomain mcelog[581]: CPUID Vendor Intel Family 6 Model 42
[ 900.696678] f20c.localdomain mcelog[581]: Hardware event. This is not a software error.
[ 900.697342] f20c.localdomain mcelog[581]: MCE 1
[ 900.698027] f20c.localdomain mcelog[581]: CPU 5 THERMAL EVENT TSC 1693342fb66
[ 900.698698] f20c.localdomain mcelog[581]: TIME 1392524773 Sat Feb 15 21:26:13 2014
[ 900.699344] f20c.localdomain mcelog[581]: Processor 5 heated above trip temperature. Throttling enabled.
[ 900.699947] f20c.localdomain mcelog[581]: Please check your system cooling. Performance will be impacted
[ 900.700535] f20c.localdomain mcelog[581]: STATUS 880003c3 MCGSTATUS 0
[ 900.701123] f20c.localdomain mcelog[581]: MCGCAP c09 APICID 3 SOCKETID 0
[ 900.701720] f20c.localdomain mcelog[581]: CPUID Vendor Intel Family 6 Model 42
[ 900.702391] f20c.localdomain mcelog[581]: Hardware event. This is not a software error.
[ 900.702962] f20c.localdomain mcelog[581]: MCE 2
[ 900.703579] f20c.localdomain mcelog[581]: CPU 1 THERMAL EVENT TSC 1693365cb04
[ 900.704192] f20c.localdomain mcelog[581]: TIME 1392524773 Sat Feb 15 21:26:13 2014
[ 900.704776] f20c.localdomain mcelog[581]: Processor 1 below trip temperature. Throttling disabled
[ 900.705385] f20c.localdomain mcelog[581]: STATUS 88010282 MCGSTATUS 0
[ 900.705980] f20c.localdomain mcelog[581]: MCGCAP c09 APICID 2 SOCKETID 0
[ 900.706561] f20c.localdomain mcelog[581]: CPUID Vendor Intel Family 6 Model 42
[ 900.707147] f20c.localdomain mcelog[581]: Hardware event. This is not a software error.
[ 900.707716] f20c.localdomain mcelog[581]: MCE 3
[ 900.708333] f20c.localdomain mcelog[581]: CPU 5 THERMAL EVENT TSC 16933662906
[ 900.708923] f20c.localdomain mcelog[581]: TIME 1392524773 Sat Feb 15 21:26:13 2014
[ 900.709557] f20c.localdomain mcelog[581]: Processor 5 below trip temperature. Throttling disabled
[ 900.710102] f20c.localdomain mcelog[581]: STATUS 88010282 MCGSTATUS 0
[ 900.710633] f20c.localdomain mcelog[581]: MCGCAP c09 APICID 3 SOCKETID 0
[ 900.711160] f20c.localdomain mcelog[581]: CPUID Vendor Intel Family 6 Model 42
Chris, what are the actual core temps when this happens? On boot after installing lm_sensors, I get messages for each core: kernel: CPU2: Package temperature above threshold, cpu clock throttled There is no mce event, and fans are noticeable but not loud. This is the result from the sensors command at that time: # sensors coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +81.0°C (high = +86.0°C, crit = +100.0°C) Core 0: +80.0°C (high = +86.0°C, crit = +100.0°C) Core 1: +80.0°C (high = +86.0°C, crit = +100.0°C) Core 2: +80.0°C (high = +86.0°C, crit = +100.0°C) Core 3: +76.0°C (high = +86.0°C, crit = +100.0°C) pkg-temp-0-virtual-0 Adapter: Virtual device temp1: +80.0°C applesmc-isa-0300 Adapter: ISA adapter Left side : 3695 RPM (min = 2000 RPM, max = 6200 RPM) Right side : 3685 RPM (min = 2000 RPM, max = 6200 RPM) TB0T: +30.2°C TB1T: +30.2°C TB2T: +29.2°C TC0C: +77.0°C TC0D: +78.2°C TC0E: +88.0°C TC0F: +90.0°C TC0P: +66.5°C TC1C: +75.0°C TC2C: +75.0°C TC3C: +75.0°C TC4C: +74.0°C TCGC: +75.0°C TCSA: +75.0°C TCTD: -1.0°C TG0D: +74.2°C TG0P: +71.5°C THSP: +44.0°C TM0S: +59.0°C TMBS: +0.0°C TP0P: +58.8°C TPCD: +60.0°C TW0P: -127.0°C Th1H: +63.0°C Looks like you have a genuine cooling issue, MCE is doing it's job. If you drop back to 3.8.x, 3.7.x, 3.6.x kernels, does the problem disappear? Does top show a huge process, any chance your cooling intake sucked up a furball? Sorry, hung process, not huge. Problem doesn't occur on older kernels. top shows fractional percent usages. The fans, intake, exhaust are all clean - this is a Macbook Pro laptop. I don't get anywhere near the amount of heat when running OS X as with linux. Even right now while idle it's never idling the fans and it's quite warm. So this might partially be a radeon driver issue. If I use nomodeset at boot, temperatures at ~10C cooler, and fans are idle. It ultimately doesn't solve the problem because even moving the mouse arrow around causes gnome-shell to hit 99% and X to hit 60+%, probably due to the use of llvmpipe, and then I get CPU temperature complaints. I am also seeing these symptoms. Mine is a Thinkpad T520 (nvidia driver installed). journalctl /usr/sbin/mcelog output (partial) Jan 28 09:33:12 carbon mcelog[863]: CPU 1 THERMAL EVENT TSC 9da0d815f99 Jan 28 09:33:12 carbon mcelog[863]: TIME 1390923054 Tue Jan 28 09:30:54 2014 Jan 28 09:33:12 carbon mcelog[863]: Processor 1 heated above trip temperature. Throttling enabled. Jan 28 09:33:12 carbon mcelog[863]: Please check your system cooling. Performance will be impacted Jan 28 09:33:12 carbon mcelog[863]: STATUS 88030003 MCGSTATUS 0 This did not happen in the earlier kernels. I see something like this too. Here's a particularly egregious example of it going off multiple times within the space of 10 minutes the other day. I don't even think I was using this machine at the time... % journalctl -b -u mcelog.service -o short-precise | grep 'Feb 15 16:' Feb 15 16:12:57.589775 rustin mcelog[620]: Kernel does not support page offline interface Feb 15 16:12:57.590260 rustin mcelog[620]: Hardware event. This is not a software error. Feb 15 16:12:57.590726 rustin mcelog[620]: MCE 0 Feb 15 16:12:57.591177 rustin mcelog[620]: CPU 1 THERMAL EVENT TSC 37ba988f13b Feb 15 16:12:57.591562 rustin mcelog[620]: TIME 1392505915 Sat Feb 15 16:11:55 2014 Feb 15 16:12:57.591954 rustin mcelog[620]: Processor 1 heated above trip temperature. Throttling enabled. Feb 15 16:12:57.592368 rustin mcelog[620]: Please check your system cooling. Performance will be impacted Feb 15 16:12:57.592791 rustin mcelog[620]: STATUS 88010003 MCGSTATUS 0 Feb 15 16:12:57.593167 rustin mcelog[620]: MCGCAP 806 APICID 1 SOCKETID 0 Feb 15 16:12:57.593535 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23 Feb 15 16:12:57.593941 rustin mcelog[620]: Hardware event. This is not a software error. Feb 15 16:12:57.594450 rustin mcelog[620]: MCE 1 Feb 15 16:12:57.594934 rustin mcelog[620]: CPU 1 THERMAL EVENT TSC 37ba99fd339 Feb 15 16:12:57.595306 rustin mcelog[620]: TIME 1392505915 Sat Feb 15 16:11:55 2014 Feb 15 16:12:57.595680 rustin mcelog[620]: Processor 1 below trip temperature. Throttling disabled Feb 15 16:12:57.598325 rustin mcelog[620]: STATUS 88010002 MCGSTATUS 0 Feb 15 16:12:57.598920 rustin mcelog[620]: MCGCAP 806 APICID 1 SOCKETID 0 Feb 15 16:12:57.599317 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23 Feb 15 16:15:27.590031 rustin mcelog[620]: Hardware event. This is not a software error. Feb 15 16:15:27.590803 rustin mcelog[620]: MCE 0 Feb 15 16:15:27.591468 rustin mcelog[620]: CPU 0 THERMAL EVENT TSC 3c7146fe9ea Feb 15 16:15:27.592133 rustin mcelog[620]: TIME 1392506081 Sat Feb 15 16:14:41 2014 Feb 15 16:15:27.592819 rustin mcelog[620]: Processor 0 heated above trip temperature. Throttling enabled. Feb 15 16:15:27.593459 rustin mcelog[620]: Please check your system cooling. Performance will be impacted Feb 15 16:15:27.594119 rustin mcelog[620]: STATUS 88010003 MCGSTATUS 0 Feb 15 16:15:27.594813 rustin mcelog[620]: MCGCAP 806 APICID 0 SOCKETID 0 Feb 15 16:15:27.595460 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23 Feb 15 16:15:27.598247 rustin mcelog[620]: Hardware event. This is not a software error. Feb 15 16:15:27.598960 rustin mcelog[620]: MCE 1 Feb 15 16:15:27.599595 rustin mcelog[620]: CPU 0 THERMAL EVENT TSC 3c71486d368 Feb 15 16:15:27.600289 rustin mcelog[620]: TIME 1392506081 Sat Feb 15 16:14:41 2014 Feb 15 16:15:27.601079 rustin mcelog[620]: Processor 0 below trip temperature. Throttling disabled Feb 15 16:15:27.601774 rustin mcelog[620]: STATUS 88010002 MCGSTATUS 0 Feb 15 16:15:27.602407 rustin mcelog[620]: MCGCAP 806 APICID 0 SOCKETID 0 Feb 15 16:15:27.603071 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23 Feb 15 16:19:12.589358 rustin mcelog[620]: Hardware event. This is not a software error. Feb 15 16:19:12.589872 rustin mcelog[620]: MCE 0 Feb 15 16:19:12.590361 rustin mcelog[620]: CPU 1 THERMAL EVENT TSC 432b1a08a06 Feb 15 16:19:12.590856 rustin mcelog[620]: TIME 1392506298 Sat Feb 15 16:18:18 2014 Feb 15 16:19:12.591248 rustin mcelog[620]: Processor 1 heated above trip temperature. Throttling enabled. Feb 15 16:19:12.591638 rustin mcelog[620]: Please check your system cooling. Performance will be impacted Feb 15 16:19:12.592054 rustin mcelog[620]: STATUS 88010003 MCGSTATUS 0 Feb 15 16:19:12.592438 rustin mcelog[620]: MCGCAP 806 APICID 1 SOCKETID 0 Feb 15 16:19:12.592832 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23 Feb 15 16:19:12.593381 rustin mcelog[620]: Hardware event. This is not a software error. Feb 15 16:19:12.593927 rustin mcelog[620]: MCE 1 Feb 15 16:19:12.594464 rustin mcelog[620]: CPU 1 THERMAL EVENT TSC 432b1b769f3 Feb 15 16:19:12.596071 rustin mcelog[620]: TIME 1392506298 Sat Feb 15 16:18:18 2014 Feb 15 16:19:12.596546 rustin mcelog[620]: Processor 1 below trip temperature. Throttling disabled Feb 15 16:19:12.597162 rustin mcelog[620]: STATUS 88010002 MCGSTATUS 0 Feb 15 16:19:12.597557 rustin mcelog[620]: MCGCAP 806 APICID 1 SOCKETID 0 Feb 15 16:19:12.597963 rustin mcelog[620]: CPUID Vendor Intel Family 6 Model 23 This machine has Intel graphics so it can't just be radeon, either. I ought to have logs going back at least a year on this machine so I'll see if I can figure out when it started happening. (In reply to T.C. Hollingsworth from comment #15) > I ought to have logs going back at least a year on this machine so I'll see > if I can figure out when it started happening. I lied, I guess I never adjusted logrotate on this machine, sorry. :-( I tested two different kernels, while it does seem to appear in kernel 3.12.8-300.fc20.x86_64, the problem appeared in kernel 3.12.9-301.fc20.x86_64 lspci | grep VGA 00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09) ---- log ---- uname -a Linux oboe.localdomain 3.12.9-301.fc20.x86_64 #1 SMP Wed Jan 29 15:56:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux journalctl | grep temp | grep above | tail -10 Feb 10 23:25:29 oboe.localdomain mcelog[570]: Processor 0 heated above trip temperature. Throttling enabled. Feb 10 23:25:29 oboe.localdomain mcelog[570]: Processor 2 heated above trip temperature. Throttling enabled. Feb 15 14:42:28 oboe.localdomain kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 124388) Feb 15 14:42:28 oboe.localdomain kernel: CPU2: Core temperature above threshold, cpu clock throttled (total events = 124389) Feb 15 14:42:28 oboe.localdomain kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 171308) Feb 15 14:42:29 oboe.localdomain kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 171308) Feb 15 14:42:29 oboe.localdomain kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 171308) Feb 15 14:42:29 oboe.localdomain kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 171307) Feb 15 14:44:04 oboe.localdomain mcelog[570]: Processor 2 heated above trip temperature. Throttling enabled. Feb 15 14:44:04 oboe.localdomain mcelog[570]: Processor 0 heated above trip temperature. Throttling enabled. ############################### Linux oboe.localdomain 3.12.8-300.fc20.x86_64 #1 SMP Thu Jan 16 01:07:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux acpitz-virtual-0 Adapter: Virtual device temp1: +79.0°C (crit = +108.0°C) asus-isa-0000 Adapter: ISA adapter temp1: +79.0°C coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +81.0°C (high = +87.0°C, crit = +105.0°C) Core 0: +81.0°C (high = +87.0°C, crit = +105.0°C) Core 1: +78.0°C (high = +87.0°C, crit = +105.0°C) pkg-temp-0-virtual-0 Adapter: Virtual device temp1: +81.0°C Tue Feb 18 12:16:57 CET 2014 [... compiling GRASS GIS on 4 cores...] Tue Feb 18 12:17:14 CET 2014 acpitz-virtual-0 Adapter: Virtual device temp1: +79.0°C (crit = +108.0°C) asus-isa-0000 Adapter: ISA adapter temp1: +79.0°C coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +78.0°C (high = +87.0°C, crit = +105.0°C) Core 0: +78.0°C (high = +87.0°C, crit = +105.0°C) Core 1: +77.0°C (high = +87.0°C, crit = +105.0°C) pkg-temp-0-virtual-0 Adapter: Virtual device temp1: +78.0°C Tue Feb 18 12:17:42 CET 2014 acpitz-virtual-0 Adapter: Virtual device temp1: +77.0°C (crit = +108.0°C) asus-isa-0000 Adapter: ISA adapter temp1: +77.0°C coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +79.0°C (high = +87.0°C, crit = +105.0°C) Core 0: +79.0°C (high = +87.0°C, crit = +105.0°C) Core 1: +75.0°C (high = +87.0°C, crit = +105.0°C) pkg-temp-0-virtual-0 Adapter: Virtual device temp1: +79.0°C --> no issues with 3.12.8-300.fc20.x86_64 No such issue with kernel 3.12.8-300.fc20.x86_64 but starting with kernel 3.12.9-301.fc20.x86_64. Confirmed also with kernel 3.13.3-201.fc20.x86_64 I started getting loads of these messages with kernel-3.13.3-201. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.14.4-200.fc20. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. Yes it still happens with 3.14.4. I see this as well with 3.14.4. Still happens with 3.16.0-0.rc2.git1.1.fc21.x86_64. See also: https://bugzilla.redhat.com/show_bug.cgi?id=1050106 This was my issue on Fedora-19 and now still on Fedora-20 (the issue, which has similarity to this issue, was never resolved). Perhaps this additional information will be helpful to both bug filings. Is this bug about the laptop running hot or the messages in the log? I think the actual log messages are a new feature. In my case, my laptop runs hot under just about any load, there's nothing new there. But a few versions back, the kernel started reporting the MCE messages. It would be nice if they could be disabled. I know my laptop is hot, but I don't need constant log messages about it. And worse, abrt keeps triggering on them, which is not helping the situation any... (In reply to Samuel Sieb from comment #25) > Is this bug about the laptop running hot or the messages in the log? This bug is about laptops running hot with kernels > 3.8 while older kernels work just fine. IMHO this is a reasonable thing to log, but abrt really shouldn't go off every time, especially since it won't let you file a bug anyway. >:-( Might want to file a bug against abrt for that. Looking back on this I see I'm even more of an idiot than I thought; I switched this machine to use journal persistence before it was made default and that's why I never touched logrotate... I can now confirm this also started when I upgraded to the 3.9.x kernels back when this machine ran F18. Specifically, kernel-3.8.11-200.fc18.x86_64 -> kernel-3.9.11-200.fc18.x86_64. (Yeah, I'm bad at updating sometimes. ;-) I have about eight months of logs with not a single complaint of temperature problems and then after I rebooted into that kernel the flood started and continues to this day. Are you sure that temperature problems started then? Or maybe that is just when the kernel started logging them? There's a potentially relevant commit around that time at: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kernel/cpu/mcheck/therm_throt.c?id=25cdce170d28092e8e162f36702be3308973b19d I'm still seeing the temperature warnings with 3.16, and abrt is reporting dozens of mce events per day which seems much more aggressive than previously. The original bug description is not specific enough. It was more about being confused if the messages are legitimate temperature warnings; if they are, why are they happening; if they are, are they a risk to the hardware? I ask because on all Macs I have, I get these warnings, they all run much hotter running Linux than OS X. And I've had 1 of 3 machines die while running hot, inexplicably. Since it's dead (no startup chime, and no boot manager comes up, i.e. is not making it to or not completing POST) I don't know if its death is coincidence or related to overheating. Since I'm down to one Mac (one dead, one given away, one for day to day use) I'm reluctant to do further baremetal testing until I understand what the messages mean, what the risk assessement is, why the machines all appear to be overheating and only when running Linux. Created attachment 924266 [details]
dmesg 3.16.0-0.rc7.git4.1
Still the same system as originally reported: Apple Inc. MacBookPro8,2/Mac-94245A3940C91C80, BIOS MBP81.88Z.0047.B27.1201241646 01/24/12
This dmesg captured during installation from USB stick made with Fedora-Live-LXDE-x86_64-21-20140804.iso.
Suspicious items possibly related to CPU or power management.
[ 0.108157] perf_event_intel: PEBS disabled due to CPU errata, please upgrade microcode
I have no idea how to upgrade microcode.
[ 3.243996] hpet: probe of PNP0103:00 failed with error -22
[ 0.055391] CPU0: Thermal monitoring enabled (TM1)
I do not get this message for the other 7 CPUs, but CPU0-7 all have temperature above threshold messages, so this seems unrelated.
Anyway it seems to me something is wrong since it gets so hot, fans frequently go to max, and the kernel also reports high temps and mce events. So if there's a way to manually lower the CPU throttling threshold (kernel parameter maybe) at expense of performance, that would be a better-than-nothing work around. The current behavior is at best very undesirable, and at worst it might be burning up laptops.
(In reply to Chris Murphy from comment #30) > I have no idea how to upgrade microcode. microcode_ctl-2:2.1-5.fc20.x86_64 (In reply to Sergio Monteiro Basto from comment #31) > (In reply to Chris Murphy from comment #30) > > I have no idea how to upgrade microcode. > > microcode_ctl-2:2.1-5.fc20.x86_64 I have microcode_ctl-2.1-6.fc21.x86_64 and yet I still get the message "PEBS disabled due to CPU errata, please upgrade microcode" so how do I upgrade the microcode when I already have the current version? Nevermind. Looks like it's being used this whole time. [ 0.046636] localhost.localdomain kernel: perf_event_intel: PEBS disabled due to CPU errata, please upgrade microcode [snip] [ 17.972775] twenty1.localdomain kernel: perf_event_intel: PEBS enabled due to microcode update Disabling the intel turbo boost solved the problem for me. # echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo # lscpu | grep "Model name" Model name: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz # cpupower frequency-info -p analyzing CPU 0: 800000 4000000 powersave # cpupower frequency-info -d analyzing CPU 0: intel_pstate # uname -r 3.15.10-201.fc20.x86_64 Just a quick FYI on this, with my Macbook Pro I noticed the fans didn't come on to the same extent as OSX. It appears the fan control is broken, at least on my Fedora 20 install fully up to date. Lack of fans = frequency throttling when core temp goes above 85C After looking in to it, I found a daemon to manage fans on Apple systems: https://github.com/dgraziotin/Fan-Control-Daemon I'm sure this lack of functionality is either a missing feature or bug upstream, but for those in this thread, my fans now spin-up correctly when running that daemon. Have fun! https://access.redhat.com/solutions/35494 Disabling the "C States" in the BIOS, so that the CPU is always running at full power. hardware check did not reveal any issues. It was identified that C state in BIOS resulted in less power and this resulted in errors. This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle. Changing version to '22'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22 this bug still appears on F22 (fresh install) uname -a Linux lomok2 4.0.4-303.fc22.x86_64 #1 SMP Thu May 28 12:37:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux lscpu | grep "Model name" Model name: Intel(R) Core(TM) i3-2330M CPU @ 2.20GHz Same here on a Lenovo ThinkPad T540p (upgraded installaion): uname -a Linux gibraltar 4.0.4-303.fc22.x86_64 #1 SMP Thu May 28 12:37:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux lscpu | grep "Model name" Model name: Intel(R) Core(TM) i7-4900MQ CPU @ 2.80GHz I am seeing this on F22 with a vanilla kernel.org kernel (4.2.0-rc4), so it looks like this issue is not a regression in the Fedora kernel, but rather a more general one off of mainline. I am having the same problem in F22 with kernel 4.1.3-200.fc22.x86_64 on a Lenovo W550s. ==================================================================== Aug 13 14:09:56 localhost kernel: mce: [Hardware Error]: Machine check events logged Aug 13 14:09:56 localhost mcelog: Hardware event. This is not a software error. Aug 13 14:09:56 localhost mcelog: MCE 0 Aug 13 14:09:56 localhost mcelog: CPU 2 THERMAL EVENT TSC 5a1094fd777 Aug 13 14:09:56 localhost mcelog: TIME 1439464080 Thu Aug 13 14:08:00 2015 Aug 13 14:09:56 localhost mcelog: Processor 2 heated above trip temperature. Throttling enabled. Aug 13 14:09:56 localhost mcelog: Please check your system cooling. Performance will be impacted Aug 13 14:09:56 localhost mcelog: STATUS 88200803 MCGSTATUS 0 Aug 13 14:09:56 localhost mcelog: MCGCAP 1000c07 APICID 2 SOCKETID 0 Aug 13 14:09:56 localhost mcelog: CPUID Vendor Intel Family 6 Model 61 Aug 13 14:09:56 localhost mcelog: Hardware event. This is not a software error. Aug 13 14:09:56 localhost mcelog: MCE 1 Aug 13 14:09:56 localhost mcelog: CPU 3 THERMAL EVENT TSC 5a1095025dd Aug 13 14:09:56 localhost mcelog: TIME 1439464080 Thu Aug 13 14:08:00 2015 Aug 13 14:09:56 localhost mcelog: Processor 3 heated above trip temperature. Throttling enabled. Aug 13 14:09:56 localhost mcelog: Please check your system cooling. Performance will be impacted Aug 13 14:09:56 localhost mcelog: STATUS 88200803 MCGSTATUS 0 Aug 13 14:09:56 localhost mcelog: MCGCAP 1000c07 APICID 3 SOCKETID 0 Aug 13 14:09:56 localhost mcelog: CPUID Vendor Intel Family 6 Model 61 Aug 13 14:09:56 localhost mcelog: Hardware event. This is not a software error. Aug 13 14:09:56 localhost mcelog: MCE 2 Aug 13 14:09:56 localhost mcelog: CPU 3 THERMAL EVENT TSC 5a109990284 Aug 13 14:09:56 localhost mcelog: TIME 1439464080 Thu Aug 13 14:08:00 2015 Aug 13 14:09:56 localhost mcelog: Processor 3 below trip temperature. Throttling disabled Aug 13 14:09:56 localhost mcelog: STATUS 88210802 MCGSTATUS 0 Aug 13 14:09:56 localhost mcelog: MCGCAP 1000c07 APICID 3 SOCKETID 0 Aug 13 14:09:56 localhost mcelog: CPUID Vendor Intel Family 6 Model 61 Aug 13 14:09:56 localhost mcelog: Hardware event. This is not a software error. Aug 13 14:09:56 localhost mcelog: MCE 3 Aug 13 14:09:56 localhost mcelog: CPU 2 THERMAL EVENT TSC 5a10999272f Aug 13 14:09:56 localhost mcelog: TIME 1439464080 Thu Aug 13 14:08:00 2015 Aug 13 14:09:56 localhost mcelog: Processor 2 below trip temperature. Throttling disabled Aug 13 14:09:56 localhost mcelog: STATUS 88210802 MCGSTATUS 0 Aug 13 14:09:56 localhost mcelog: MCGCAP 1000c07 APICID 2 SOCKETID 0 Aug 13 14:09:56 localhost mcelog: CPUID Vendor Intel Family 6 Model 61 ======================================================================= I am monitoring CPU core temperatures with lm_sensors and xfce4-sensors-plugin. The temperatures seem to be varying between 46C and 58C. Same here with a 3rd gen X1 Carbon. 1 choeger@oxide ~ % cat /proc/cpuinfo | grep "model name" model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz I can trigger this behaviour by running md5sum /dev/urandom this uses one core at 100% and should not cause thermal issues IMO. So it seems either fancontrol or turbo boost goes awry. I also noticed the following during the stress test: choeger@oxide ~ % cat /proc/cpuinfo | grep MH cpu MHz : 3100.398 cpu MHz : 3100.093 cpu MHz : 3199.218 cpu MHz : 3143.054 I am not an expert regarding CPUs, but given just one job at 100%, shouldn't the other cores be idle? It seems as if _all_ cores go into turbo mode (which makes thermal problems quite likely). I can confirm the same as Christoph Höger above: all cores going into turbo mode with md5sum /dev/urandom To protect my computer I adopted the (hopefully) temporary solution by Juha Heljoranta above, that is, added the following line into /etc/rc.d/rc.local echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo I have not seen the error since even though I tried to reproduce it. Also, as expected, none of the cores now go into turbo mode. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs. Fedora 22 has now been rebased to 4.2.3-200.fc22. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23. If you experience different issues, please open a new bug report for those. Still seen on 4.2.3-200.fc22.x86_64 running on a ThinkPad T440p (i7-4600M). Still failing in F22 i7-5600U Still happens on 4.2.3-300.fc23.x86_64, same hardware as I mentioned in comment #39. Still happens on 4.0.4-301.fc22.x86_64 on a Thinkpad X1 Carbon 3rd gen. However, the solution as posted in https://bugzilla.redhat.com/show_bug.cgi?id=924570#c34 works fine here, so it's obviously related to the Intel Turbo Boost. After deactivating this I can run 4 cores at full load with out any MCE triggered. Still happens on 4.2.5-300.fc23.x86_64 on Thinkpad X1 Carbon 3rd gen i7-5600U Same problem here on X1 Carbon 3rd gen. I'm having this problem on a X1 Carbon 3rd gen and Fedora 23 # dmidecode -t system | grep Version Version: ThinkPad X1 Carbon 3rd # uname -a Linux jmoon 4.2.8-300.fc23.x86_64 #1 SMP Tue Dec 15 16:49:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux This is a sufficiently old bug with no response from kernel maintainers as to what it means, what the user should do about it, that it probably needs to go upstream and asked on lkml. Because right now the warnings are sufficiently dire with no work around other than to not use linux at all on the hardware. Clearly something is wrong if the manufacturer's diagnostics say the hardware is OK, and yet the kernel is claiming (inspecific) hardware errors, but only after the CPU is allowed to get too hot. I don't get such overheating and crazy fan speeds running either OS X or Windows on the same hardware, so at the moment I think the burden is on kernel and cpu microcode experts to say what these messages mean. Problem exists on X240 with Latest F23. Also exists on Thinkpad W541 with latest F23. i5-2467M Samsung 530U3 laptop, running 4.4.5-300.fc23.x86-64 echo 1 > ...../intel_pstate/no_turbo work around works. This sounds like it might be related: https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.6-Thermal-Updates I hope there's a back-port soon. Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. Still occurs in f24, and see also the recently opened bugs #1301739 #1373881 #1284144 Yep it's present through 4.8.0-0.rc5.git1.1.fc25.x86_64 and then I also get four of these: Sep 07 10:08:01 f24m mcelog[824]: Hardware event. This is not a software error. Sep 07 10:08:01 f24m mcelog[824]: MCE 0 Sep 07 10:08:01 f24m mcelog[824]: CPU 5 THERMAL EVENT TSC a8b749118 Sep 07 10:08:01 f24m mcelog[824]: TIME 1473264479 Wed Sep 7 10:07:59 2016 Sep 07 10:08:01 f24m mcelog[824]: Processor 5 heated above trip temperature. Throttling enabled. Sep 07 10:08:01 f24m mcelog[824]: Please check your system cooling. Performance will be impacted Sep 07 10:08:01 f24m mcelog[824]: STATUS 880003c3 MCGSTATUS 0 Sep 07 10:08:01 f24m mcelog[824]: MCGCAP c09 APICID 3 SOCKETID 0 Sep 07 10:08:01 f24m mcelog[824]: CPUID Vendor Intel Family 6 Model 42 This is also seen on Scientific Linux 7.2 (RHEL 7.2 clone) with 3.10.0-327.36.1.el7.x86_64 on ThinkPad T460s ------------------------------------------------------------ Oct 13 13:21:49 aurelius mcelog: Hardware event. This is not a software error. Oct 13 13:21:49 aurelius mcelog: MCE 0 Oct 13 13:21:49 aurelius mcelog: CPU 2 THERMAL EVENT TSC 1e7192b3a89c Oct 13 13:21:49 aurelius mcelog: TIME 1476357675 Thu Oct 13 13:21:15 2016 Oct 13 13:21:49 aurelius mcelog: Processor 2 heated above trip temperature. Throttling enabled. Oct 13 13:21:49 aurelius mcelog: Please check your system cooling. Performance will be impacted Oct 13 13:21:49 aurelius mcelog: STATUS 8809080b MCGSTATUS 0 Oct 13 13:21:49 aurelius mcelog: MCGCAP 1000c07 APICID 2 SOCKETID 0 Oct 13 13:21:49 aurelius mcelog: CPUID Vendor Intel Family 6 Model 61 Oct 13 13:21:49 aurelius mcelog: Hardware event. This is not a software error. Oct 13 13:21:49 aurelius mcelog: MCE 1 Oct 13 13:21:49 aurelius mcelog: CPU 3 THERMAL EVENT TSC 1e7192b3ebc2 Oct 13 13:21:49 aurelius mcelog: TIME 1476357675 Thu Oct 13 13:21:15 2016 Oct 13 13:21:49 aurelius mcelog: Processor 3 heated above trip temperature. Throttling enabled. Oct 13 13:21:49 aurelius mcelog: Please check your system cooling. Performance will be impacted Oct 13 13:21:49 aurelius mcelog: STATUS 8809080b MCGSTATUS 0 Oct 13 13:21:49 aurelius mcelog: MCGCAP 1000c07 APICID 3 SOCKETID 0 Oct 13 13:21:49 aurelius mcelog: CPUID Vendor Intel Family 6 Model 61 Oct 13 13:21:49 aurelius mcelog: Hardware event. This is not a software error. Oct 13 13:21:49 aurelius mcelog: MCE 2 Oct 13 13:21:49 aurelius mcelog: CPU 2 THERMAL EVENT TSC 1e7192dab9ac Oct 13 13:21:49 aurelius mcelog: TIME 1476357675 Thu Oct 13 13:21:15 2016 Oct 13 13:21:49 aurelius mcelog: Processor 2 below trip temperature. Throttling disabled Oct 13 13:21:49 aurelius mcelog: STATUS 880a080a MCGSTATUS 0 Oct 13 13:21:49 aurelius mcelog: MCGCAP 1000c07 APICID 2 SOCKETID 0 Oct 13 13:21:49 aurelius mcelog: CPUID Vendor Intel Family 6 Model 61 Oct 13 13:21:49 aurelius mcelog: Hardware event. This is not a software error. Oct 13 13:21:49 aurelius mcelog: MCE 3 Oct 13 13:21:49 aurelius mcelog: CPU 3 THERMAL EVENT TSC 1e7192dae606 Oct 13 13:21:49 aurelius mcelog: TIME 1476357675 Thu Oct 13 13:21:15 2016 Oct 13 13:21:49 aurelius mcelog: Processor 3 below trip temperature. Throttling disabled Oct 13 13:21:49 aurelius mcelog: STATUS 880a080a MCGSTATUS 0 Oct 13 13:21:49 aurelius mcelog: MCGCAP 1000c07 APICID 3 SOCKETID 0 Oct 13 13:21:49 aurelius mcelog: CPUID Vendor Intel Family 6 Model 61 ------------------------------------------------------- vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz stepping : 4 ------------------------------------------------------- kernel-3.10.0-327.36.1.el7.x86_64 microcode_ctl-2.1-12.el7_2.1.x86_64 mcelog-120-3.e7e0ac1.el7.x86_64 Trying the /sys/devices/system/cpu/intel_pstate/no_turbo workaround to see if that helps, I presume it will. Issue consistently reproduces on two different X1 carbon laptops (types 20BT and 20A7), across Fedora 22-23-24 and now 25 too. mcelog dmesg are consistent across the two machines, all the Fedora versions and the outputs already posted above. Still happening for me on ThinkPad W541 4.8.14-300.fc25.x86_64 Well unfortunately I'm also experienced this issue, I've installed https://github.com/dgraziotin/Fan-Control-Daemon and opened a bug asking for the work needed to do to get this into a upstream project: https://github.com/dgraziotin/mbpfan/issues/99 I'm not sure of what upstream project it would be, but at least let's open the discussion, because in the current state the experience is sub-optimal in Macs. (In reply to Paulo Fidalgo from comment #63) > Well unfortunately I'm also experienced this issue, I've installed > https://github.com/dgraziotin/Fan-Control-Daemon > and opened a bug asking for the work needed to do to get this into a > upstream project: > https://github.com/dgraziotin/mbpfan/issues/99 > > I'm not sure of what upstream project it would be, but at least let's open > the discussion, because in the current state the experience is sub-optimal > in Macs. Thanks for the link, but it seems like it's specific for macbooks. This BZ talks mostly about IBM/Lenovo laptops. BTW, I have the issue mitigated temporarily by disabling turbo boost: cores=$(cat /proc/cpuinfo | grep processor | awk '{print $3}') for core in $cores; do sudo wrmsr -p${core} 0x1a0 0x4000850089 state=$(sudo rdmsr -p${core} 0x1a0 -f 38:38) if [[ $state -eq 1 ]]; then echo "core ${core}: disabled" else echo "core ${core}: enabled" fi done However, this isn't a solution, or even a workaround, just an ugly hack I recommend thermald. There's an older version in copr that should still work, but lately I'm just building it myself from upstream. https://github.com/01org/thermal_daemon *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 25 kernel bugs. Fedora 25 has now been rebased to 4.9.3-200.fc25. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26. If you experience different issues, please open a new bug report for those. At least for Macbook Pro 12,1 the bug is still present in the kernel 4.9.3-200.fc25.x86_64. I've disabled the mbpfan service and after a while I've started to see the messages related to high temperatures. Linux hostnamehere 4.9.4-201.fc25.x86_64 #1 SMP Tue Jan 17 18:58:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Model name: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz System: Lenovo X1 Carbon 4th generation Disabling turbo-boost via the proc mentioned above results in no core ever going higher than 800MHz which is a non-starter. With turbo-boost enabled: No md5sum /dev/urandom: [ormandj@ormandj-laptop ~]$ cat /proc/cpuinfo |grep MHz cpu MHz : 496.972 cpu MHz : 476.123 cpu MHz : 451.513 cpu MHz : 431.860 md5sum /dev/urandom: [ormandj@ormandj-laptop ~]$ cat /proc/cpuinfo |grep MHz cpu MHz : 3387.548 cpu MHz : 3190.161 cpu MHz : 3399.511 cpu MHz : 3151.196 CPU | sys 80% | user 25% | irq 1% | | idle 296% | wait 1% | | steal 0% | guest 0% | curf 3.19GHz | curscal 93% | cpu | sys 78% | user 21% | irq 1% | | idle 0% | cpu000 w 0% | | steal 0% | guest 0% | curf 3.40GHz | curscal 99% | cpu | sys 2% | user 3% | irq 0% | | idle 95% | cpu001 w 0% | | steal 0% | guest 0% | curf 3.05GHz | curscal 89% | cpu | sys 0% | user 2% | irq 0% | | idle 98% | cpu003 w 0% | | steal 0% | guest 0% | curf 3.10GHz | curscal 91% | However, I'm not seeing the thermal run-away issue re: MCE logged or alerting. The fans spin up to max and even after letting it run like this for 5 minutes, I've seen no alerts in dmesg. On the other hand, if I use KVM/virtualization and tax the cpu, I'll see: [ 3678.935724] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x611 [ 3678.935734] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x639 [ 3678.935738] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x641 [ 3678.935742] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x619 [ 3678.992547] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x611 [ 3678.992556] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x639 [ 3678.992561] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x641 [ 3678.992566] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x619 [ 3678.997241] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x60d [ 3678.997247] kvm [6814]: vcpu0, guest rIP: 0xffffffff81060d56 unhandled rdmsr: 0x3f8 [ 3849.664358] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1) [ 3849.664359] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) [ 3849.664360] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) [ 3849.664361] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) [ 3849.664363] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) [ 3849.664367] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) [ 3849.664367] mce: [Hardware Error]: Machine check events logged [ 3849.664369] mce: [Hardware Error]: Machine check events logged [ 3849.665295] CPU2: Core temperature/speed normal [ 3849.665295] CPU0: Core temperature/speed normal [ 3849.665296] CPU3: Package temperature/speed normal [ 3849.665297] CPU1: Package temperature/speed normal [ 3849.665298] CPU0: Package temperature/speed normal [ 3849.665299] CPU2: Package temperature/speed normal In my case, at least, it appears the problem only occurs when using KVM for virtualization at this point with this version of the kernel. I do not use KVM to reproduce the issue, a large number of chrome tabs is enough, especially when using something like bluejeans web video conferencing Still a problem with 4.10.0-0.rc8.git0.1.fc26.x86_64 on MacbookPro 8,2 It may be fixed Kernel 4.11 - I noticed that there were some P-State changes in the pull requests. Have installed the vanilla mainline listed on the Fedora wiki (rc0.git4) and have not been able to make an MCE log error generate (I usually get a couple every hour during a work day) I see it during Fedora installs and I'm not seeing it with Fedora-Workstation-Live-x86_64-Rawhide-20170226.n.0.iso which has 4.11.0-0.rc0.git4.1.fc26.x86_64. Still see it with 4.10.4-200.fc25.x86_64. I can confirm issue seems to be gone using the kernel from https://dl.fedoraproject.org/pub/alt/rawhide-kernel-nodebug/x86_64 I can also confirm that the issue has gone away with the latest rawhide kernel (4.11-rc8 at the time of writing), on Fedora 25. I am using a Thinkpad T470 with a i7-7600U. Almost any load -- browsing with Firefox, mprime, VMs -- would cause the messages to appear. I still see the issue with 4.10.13-200.fc25.x86_64. kernel 4.11.0-0.rc8.git4.1.fc27 did resolve the MCE issue though there still were periodic thermal throttling messages logged. One note of caution - with the 4.11 kernel I was experiencing random BTRFS filesystem errors that would cause the filesystem to remount read-only. This happened multiple times. After each error I would reboot and scrub the filesystem to check for errors. None were found. I'm running BTRFS in a non-raid configuration on an encrypted partition. I reverted to 4.10 and haven't experienced any BTRFS issues. As with Jesse in comment #76, this seems to have resolved the MCE issue leaving the termal throttling messages: [ 232.898650] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) [ 232.898651] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1) [ 232.898652] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1) [ 232.898655] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) [ 232.898656] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1) [ 232.898657] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1) [ 232.898658] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1) [ 232.898659] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) [ 232.898660] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) [ 232.898666] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) [ 232.899629] CPU1: Core temperature/speed normal [ 232.899629] CPU0: Core temperature/speed normal [ 232.899630] CPU2: Package temperature/speed normal [ 232.899631] CPU3: Package temperature/speed normal [ 232.899632] CPU6: Package temperature/speed normal [ 232.899633] CPU4: Package temperature/speed normal [ 232.899634] CPU5: Package temperature/speed normal [ 232.899634] CPU7: Package temperature/speed normal [ 232.899635] CPU0: Package temperature/speed normal [ 232.899636] CPU1: Package temperature/speed normal This is on a Thinkpad T540p, i7-4900MQ running Fedora 25 with kernel-4.11.3-200.fc25.x86_64 from updates-testing, when building a fairly large C++ project with -j7 (build time a little over 4 mins, not cleaned out beforehand). I watched the frequencies of the cores and temperature while the build was running, frequencies went up to 3.8G then down again when temperature approached the criticial limit of 100°C. Just this one block of messages was logged shortly after the build was started. I repeated the build with a clean out source repo, it took a bit more than 13 minutes and the throttling message came almost precisely every 5 minutes/300 seconds (looks like log throttling to me): [ 4456.634244] CPU1: Core temperature above threshold, cpu clock throttled (total events = 36076) [ 4456.634244] CPU0: Core temperature above threshold, cpu clock throttled (total events = 36076) [ 4456.634246] CPU3: Package temperature above threshold, cpu clock throttled (total events = 43249) [ 4456.634249] CPU7: Package temperature above threshold, cpu clock throttled (total events = 43249) [ 4456.634250] CPU2: Package temperature above threshold, cpu clock throttled (total events = 43249) [ 4456.634251] CPU4: Package temperature above threshold, cpu clock throttled (total events = 43249) [ 4456.634252] CPU6: Package temperature above threshold, cpu clock throttled (total events = 43249) [ 4456.634253] CPU5: Package temperature above threshold, cpu clock throttled (total events = 43249) [ 4456.634254] CPU0: Package temperature above threshold, cpu clock throttled (total events = 43249) [ 4456.634260] CPU1: Package temperature above threshold, cpu clock throttled (total events = 43249) [ 4456.635249] CPU0: Core temperature/speed normal [ 4456.635250] CPU1: Core temperature/speed normal [ 4456.635251] CPU2: Package temperature/speed normal [ 4456.635252] CPU5: Package temperature/speed normal [ 4456.635253] CPU4: Package temperature/speed normal [ 4456.635254] CPU7: Package temperature/speed normal [ 4456.635255] CPU6: Package temperature/speed normal [ 4456.635255] CPU3: Package temperature/speed normal [ 4456.635256] CPU1: Package temperature/speed normal [ 4456.635256] CPU0: Package temperature/speed normal [ 4756.651133] CPU1: Core temperature above threshold, cpu clock throttled (total events = 77049) [ 4756.651134] CPU0: Core temperature above threshold, cpu clock throttled (total events = 77049) [ 4756.651136] CPU2: Package temperature above threshold, cpu clock throttled (total events = 119697) [ 4756.651138] CPU5: Package temperature above threshold, cpu clock throttled (total events = 119697) [ 4756.651139] CPU4: Package temperature above threshold, cpu clock throttled (total events = 119697) [ 4756.651142] CPU7: Package temperature above threshold, cpu clock throttled (total events = 119697) [ 4756.651143] CPU6: Package temperature above threshold, cpu clock throttled (total events = 119697) [ 4756.651144] CPU3: Package temperature above threshold, cpu clock throttled (total events = 119697) [ 4756.651145] CPU0: Package temperature above threshold, cpu clock throttled (total events = 119697) [ 4756.651152] CPU1: Package temperature above threshold, cpu clock throttled (total events = 119697) [ 4756.652136] CPU0: Core temperature/speed normal [ 4756.652137] CPU1: Core temperature/speed normal [ 4756.652138] CPU5: Package temperature/speed normal [ 4756.652139] CPU3: Package temperature/speed normal [ 4756.652140] CPU2: Package temperature/speed normal [ 4756.652141] CPU6: Package temperature/speed normal [ 4756.652142] CPU7: Package temperature/speed normal [ 4756.652143] CPU4: Package temperature/speed normal [ 4756.652143] CPU1: Package temperature/speed normal [ 4756.652144] CPU0: Package temperature/speed normal [ 5027.128441] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [ 5056.675053] CPU1: Core temperature/speed normal [ 5056.675053] CPU0: Core temperature/speed normal [ 5056.675055] CPU4: Package temperature/speed normal [ 5056.675056] CPU6: Package temperature/speed normal [ 5056.675058] CPU2: Package temperature/speed normal [ 5056.675059] CPU5: Package temperature/speed normal [ 5056.675060] CPU3: Package temperature/speed normal [ 5056.675061] CPU7: Package temperature/speed normal [ 5056.675061] CPU0: Package temperature/speed normal [ 5056.675062] CPU1: Package temperature/speed normal [ 5061.001186] CPU2: Core temperature above threshold, cpu clock throttled (total events = 100482) [ 5061.001187] CPU3: Core temperature above threshold, cpu clock throttled (total events = 100482) [ 5061.002181] CPU2: Core temperature/speed normal [ 5061.002182] CPU3: Core temperature/speed normal Still no MCE errors :). I have this very same type of the problem. I have two J1900 mini-ITX systems that I got from two different vendors in China. I also had a J1800 mini-ITX that I initially had before I sent its motherboard back to exchange it for a J1900 motherboard. The J1800 would throttle back and not return to normal speed until it was rebooted. The J1900s will indicate that they are throttling back and then in the same or next second would indicate that the speed had returned to normal. All three of these systems will freeze after running for one hour, several hours or even for some days. They will freeze eventually in all cases. The two J1900s are both running Fedora 25 and are properly updated to current packages. uname -a Linux gandalf.localnet 4.11.12-200.fc25.x86_64 #1 SMP Fri Jul 21 16:41:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 55 Model name: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz Stepping: 8 CPU MHz: 1999.200 CPU max MHz: 1999.2000 CPU min MHz: 1332.8000 BogoMIPS: 3998.40 Virtualization: VT-x L1d cache: 24K L1i cache: 32K L2 cache: 1024K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm arat $ lspci 00:00.0 Host bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx Series SoC Transaction Register (rev 0e) 00:02.0 VGA compatible controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Graphics & Display (rev 0e) 00:13.0 SATA controller: Intel Corporation Atom Processor E3800 Series SATA AHCI Controller (rev 0e) 00:14.0 USB controller: Intel Corporation Atom Processor Z36xxx/Z37xxx, Celeron N2000 Series USB xHCI (rev 0e) 00:1a.0 Encryption controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Trusted Execution Engine (rev 0e) 00:1b.0 Audio device: Intel Corporation Atom Processor Z36xxx/Z37xxx Series High Definition Audio Controller (rev 0e) 00:1c.0 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 1 (rev 0e) 00:1c.1 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 2 (rev 0e) 00:1c.2 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 3 (rev 0e) 00:1c.3 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 4 (rev 0e) 00:1f.0 ISA bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Power Control Unit (rev 0e) 00:1f.3 SMBus: Intel Corporation Atom Processor E3800 Series SMBus Controller (rev 0e) 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07) 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07) A slice example from /var/log/messages Aug 11 18:12:02 gandalf kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 24) Aug 11 18:12:02 gandalf kernel: CPU1: Core temperature above threshold, cpu clock throttled (total events = 24) Aug 11 18:12:02 gandalf kernel: CPU2: Core temperature above threshold, cpu clock throttled (total events = 24) Aug 11 18:12:05 gandalf kernel: CPU3: Core temperature above threshold, cpu clock throttled (total events = 24) Aug 11 18:12:05 gandalf kernel: CPU0: Core temperature/speed normal Aug 11 18:12:05 gandalf kernel: CPU1: Core temperature/speed normal Aug 11 18:12:05 gandalf kernel: CPU2: Core temperature/speed normal Aug 11 18:12:05 gandalf kernel: CPU3: Core temperature/speed normal $ uptime 18:15:06 up 2:58, 13 users, load average: 1.84, 1.63, 1.53 $ sensors acpitz-virtual-0 Adapter: Virtual device temp1: +26.8°C (crit = +90.0°C) coretemp-isa-0000 Adapter: ISA adapter Core 0: +34.0°C (high = +105.0°C, crit = +105.0°C) Core 1: +34.0°C (high = +105.0°C, crit = +105.0°C) Core 2: +34.0°C (high = +105.0°C, crit = +105.0°C) Core 3: +34.0°C (high = +105.0°C, crit = +105.0°C) I upgraded the two J1900 systems I mention in Comment 79 from Fedora 25 to Fedora 26. I still see the temperature above threshold errors in the messages file, but neither system hangs any more. Go figure! Today, with 4.16.14-300.fc28.x86_64 and a Dell XPS with a Intel(R) Core(TM) i7-8550U CPU I still have this issue. The system does not hand, but I have a lot of messages in dmesg. [19655.510079] CPU7: Package temperature above threshold, cpu clock throttled (total events = 212) [19655.510080] CPU3: Package temperature above threshold, cpu clock throttled (total events = 212) [19655.510082] CPU1: Package temperature above threshold, cpu clock throttled (total events = 212) [19655.510085] CPU5: Package temperature above threshold, cpu clock throttled (total events = 212) [19655.510112] CPU0: Package temperature above threshold, cpu clock throttled (total events = 212) [19655.510113] CPU6: Package temperature above threshold, cpu clock throttled (total events = 212) [19655.510113] CPU4: Package temperature above threshold, cpu clock throttled (total events = 212) [19655.510114] CPU2: Package temperature above threshold, cpu clock throttled (total events = 212) I'm seeing this with kernel 5.1.18-300.fc30.x86_64 on a fanless Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, in a Zotac CI660. The messages seem to come from the kernel, as mcelog doesn't record any problem. It seems a bit flaky, as I can run tests with "stress -c N" on other Intel CPUs, observe their throttling down with "grep MHz </proc/cpuinfo", but without kernel spamming like this. Here's a typical message, issued for each logical CPU, each time it happens: kernel: mce: CPU1: Package temperature/speed normal The system has hung once, but not under abnormally high load, just the screensaver under xfce. Haven't experienced this in a while. Currently on Fedora 36 (5.17.13). Recommend closing this 9-years old bug (which I reproed 7 years ago). |