Description of problem: Running an intensive process that loads the CPU generates warning messages on my company-issued Lenovo T460s (specs below). I think this is due to an unhandled 'mce' event, per output of 'dmesg': $ dmesg | tail -30 ... [ 9550.913754] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) [ 9550.913755] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1) [ 9550.913775] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) [ 9550.913777] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) [ 9550.913779] mce: [Hardware Error]: Machine check events logged [ 9550.913780] mce: [Hardware Error]: Machine check events logged [ 9550.913781] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) [ 9550.913782] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) [ 9550.914747] CPU2: Core temperature/speed normal [ 9550.914750] CPU2: Package temperature/speed normal [ 9550.914752] CPU0: Core temperature/speed normal [ 9550.914753] CPU3: Package temperature/speed normal [ 9550.914753] CPU1: Package temperature/speed normal [ 9550.914756] CPU0: Package temperature/speed normal ... Once the processor has cooled down I stop seeing the errors. Version-Release number of selected component (if applicable): Linux redbox 4.6.6-300.fc24.x86_64 #1 SMP Wed Aug 10 21:07:35 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Always Steps to Reproduce: 1. Run an intensive process, like 'tox' on the openstack nova project $ tox -e py27 2. Wait for error messages to appear Actual results: Error messages appear in the gnome notification area. These are not reportable as they are system errors. Expected results: N/A Additional info: CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz Mem: 19G (19965188 kB)
I should add that the warning messages are through the Gnome notification area: I discovered the dmesg output by trying to figure out what the issue is. 'mcelog' is no help: $ sudo mcelog --client $ sudo mcelog mcelog: Family 6 Model 4e CPU: only decoding architectural errors
happens all the time on t450 as well. any kernel version doesn't matter. latest bios/fw. makes battery life around 60-90 minutes.
Same issue with Lenovo X1 Carbon with below specs: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 78 Model name: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz Stepping: 3 CPU MHz: 2100.000 CPU max MHz: 3400.0000 CPU min MHz: 400.0000 BogoMIPS: 5615.89 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 4096K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp Probably more info than needed, but: [41849.676814] thinkpad_acpi: EC reports that Thermal Table has changed [43683.991324] thinkpad_acpi: EC reports that Thermal Table has changed [62613.701635] perf: interrupt took too long (4086 > 4073), lowering kernel.perf_event_max_sample_rate to 48000 [78848.151700] thinkpad_acpi: EC reports that Thermal Table has changed [80415.114750] thinkpad_acpi: EC reports that Thermal Table has changed [95938.642762] ------------[ cut here ]------------ [95938.642802] WARNING: CPU: 1 PID: 723 at drivers/net/wireless/intel/iwlwifi/mvm/tx.c:1377 iwl_mvm_rx_tx_cmd+0x7fd/0xa10 [iwlmvm] [95938.642807] Modules linked in: rfcomm ccm fuse ip6t_REJECT nf_reject_ipv6 xt_conntrack ip6t_rpfilter ip_set nfnetlink ebtable_nat ebtable_broute bridge ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_mangle ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle iptable_security ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep vfat fat snd_soc_skl snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_sst_match intel_rapl arc4 snd_soc_core x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic snd_compress kvm iTCO_wdt iTCO_vendor_support snd_pcm_dmaengine ac97_bus mei_wdt snd_hda_intel acer_wmi snd_hda_codec btusb [95938.642897] sparse_keymap btrtl btbcm iwlmvm btintel bluetooth mac80211 irqbypass crct10dif_pclmul uvcvideo crc32_pclmul snd_hda_core ghash_clmulni_intel snd_hwdep intel_cstate intel_rapl_perf snd_seq videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core iwlwifi snd_seq_device videodev snd_pcm media cfg80211 joydev rtsx_pci_ms memstick i2c_i801 thinkpad_acpi snd_timer mei_me snd mei shpchp soundcore rfkill wmi tpm_crb intel_pch_thermal tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace vboxpci(OE) vboxnetadp(OE) sunrpc vboxnetflt(OE) vboxdrv(OE) 8021q garp stp llc mrp i915 rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper e1000e crc32c_intel drm ptp serio_raw nvme pps_core nvme_core rtsx_pci video fjes [95938.643011] CPU: 1 PID: 723 Comm: irq/129-iwlwifi Tainted: G W OE 4.7.3-200.fc24.x86_64 #1 [95938.643016] Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET41W (1.15 ) 06/23/2016 [95938.643021] 0000000000000286 00000000a857e9b6 ffff88040d79bbb8 ffffffffb63d961f [95938.643030] 0000000000000000 0000000000000000 ffff88040d79bbf8 ffffffffb609faab [95938.643038] 000005610d79bc08 0000000000000000 0000000000000000 0000000000000600 [95938.643046] Call Trace: [95938.643062] [<ffffffffb63d961f>] dump_stack+0x63/0x84 [95938.643072] [<ffffffffb609faab>] __warn+0xcb/0xf0 [95938.643081] [<ffffffffb609fbdd>] warn_slowpath_null+0x1d/0x20 [95938.643108] [<ffffffffc0a4543d>] iwl_mvm_rx_tx_cmd+0x7fd/0xa10 [iwlmvm] [95938.643131] [<ffffffffc0a3b4bc>] iwl_mvm_rx_common+0x18c/0x2a0 [iwlmvm] [95938.643150] [<ffffffffc0a3b62b>] iwl_mvm_rx+0x5b/0x70 [iwlmvm] [95938.643167] [<ffffffffc088d6eb>] iwl_pcie_rx_handle+0x30b/0x860 [iwlwifi] [95938.643185] [<ffffffffc088f22d>] iwl_pcie_irq_handler+0x6ad/0xae0 [iwlwifi] [95938.643193] [<ffffffffb67e7102>] ? __schedule+0x2f2/0x780 [95938.643202] [<ffffffffb60fc7c0>] ? irq_forced_thread_fn+0x70/0x70 [95938.643210] [<ffffffffb60fc7e0>] irq_thread_fn+0x20/0x50 [95938.643218] [<ffffffffb60fca1d>] irq_thread+0x12d/0x1b0 [95938.643226] [<ffffffffb60fc840>] ? wake_threads_waitq+0x30/0x30 [95938.643234] [<ffffffffb60fc8f0>] ? irq_thread_dtor+0xb0/0xb0 [95938.643242] [<ffffffffb60bf4d8>] kthread+0xd8/0xf0 [95938.643251] [<ffffffffb67eba7f>] ret_from_fork+0x1f/0x40 [95938.643259] [<ffffffffb60bf400>] ? kthread_worker_fn+0x180/0x180 [95938.643265] ---[ end trace 9d3276affb4f208e ]--- [143127.395226] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1) [143127.395227] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) [143127.395228] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) [143127.395228] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) [143127.395231] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) [143127.395234] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) [143127.395237] mce: [Hardware Error]: Machine check events logged [143127.395237] mce: [Hardware Error]: Machine check events logged
I have been getting these errors for a long time, even while running simple bluejeans application or google docs. Initially I felt it was cooling issue. Lenovo changed the heat sink, but issue continues only the occurrences have reduced. ---------------------------------------------------------------------------------------------------- Sep 25 11:35:09 mcelog: CPUID Vendor Intel Family 6 Model 78 Sep 25 11:35:09 mcelog: mcelog: Family 6 Model 4e CPU: only decoding architectural errors Sep 25 11:35:09 mcelog: Hardware event. This is not a software error. Sep 25 11:35:09 mcelog: MCE 1 Sep 25 11:35:09 mcelog: CPU 0 THERMAL EVENT TSC 13859093a8 Sep 25 11:35:09 mcelog: TIME 1474783509 Sun Sep 25 11:35:09 2016 Sep 25 11:35:09 mcelog: Processor 0 below trip temperature. Throttling disabled Sep 25 11:35:09 mcelog: STATUS 881a2802 MCGSTATUS 0 Sep 25 11:35:09 mcelog: MCGCAP c08 APICID 0 SOCKETID 0 Sep 25 11:35:09 mcelog: CPUID Vendor Intel Family 6 Model 78 Sep 25 11:35:09 mcelog: mcelog: Family 6 Model 4e CPU: only decoding architectural errors Sep 25 11:35:09 mcelog: Hardware event. This is not a software error. Sep 25 11:35:09 mcelog: MCE 2 Sep 25 11:35:09 mcelog: CPU 2 THERMAL EVENT TSC 1387fe7336 Sep 25 11:35:09 mcelog: TIME 1474783509 Sun Sep 25 11:35:09 2016 Sep 25 11:35:09 mcelog: Processor 2 heated above trip temperature. Throttling enabled. Sep 25 11:35:09 mcelog: Please check your system cooling. Performance will be impacted Sep 25 11:35:09 mcelog: STATUS 88192803 MCGSTATUS 0 Sep 25 11:35:09 mcelog: MCGCAP c08 APICID 1 SOCKETID 0 Sep 25 11:35:09 mcelog: CPUID Vendor Intel Family 6 Model 78 ---------------------------------------------------------------------------------------------------- [ 14.538652] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1) [ 14.538653] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) [ 14.538656] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) [ 14.538674] mce: [Hardware Error]: Machine check events logged [ 14.538680] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) [ 14.538680] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) [ 14.539633] CPU0: Core temperature/speed normal [ 14.539634] CPU0: Package temperature/speed normal [ 14.539635] mce: [Hardware Error]: Machine check events logged [ 14.539661] CPU1: Package temperature/speed normal [ 14.539661] CPU3: Package temperature/speed normal [ 14.554147] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) [ 14.831765] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) [ 14.842704] EXT4-fs (dm-7): mounted filesystem with ordered data mode. Opts: (null) ----------------------------------------------------------------------------------------------------
cat /etc/profile.d/thermal-mce.sh shows: #!/bin/bash echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo This is on Fedora 23 and I have no idea when this was introduced as a "fix" for this bug. Fact is that despite this "fix" the error is still existent, albeit less frequently and only under higher load. But a fix it is not, quite the opposite. I'd really like a solution that just doesn't simply disable a performance enhancing hardware capability. So far, this problem has never turned up on LKML (please correct me, should I be wrong here), so I'd really like to ask the fine people of Fedora to initiate a process that actually finds a solution, not a workaround.
Same here - fc24 with Lenovo t460s.
Same here on two types on X1 carbon, Fedora 22-23-24 and today 25. Reinstalled one of the X1's with Centos 7.3 and the error also exists there. In one of the older BZs for the same issue it has been observed that the error started showing up after Fedora moved away from kernel 3.9
Also on a t460p with f24
Do you all see the horrendous battery life as well? For me my CPUs are at full speed all the time. I can manually slow them down and battery life is as expected, (I actually wrote a little script to do this for when I go on battery...).
No, that aspect is OK. Overheat only with a cpu-intensive job really there.
I have the same problem on an x1c4 with both fedora 24 and 25. Mcelog gives me mcelog: Family 6 Model 4e CPU: only decoding architectural errors mcelog: warning: 16 bytes ignored in each record mcelog: consider an update
Interestingly, I am also seeing something very similar on a HP EliteBook 840 running Red Hat Enterprise Linux Workstation release 7.3 (Maipo). The fact that the CPUs ostensibly overheat and then again cool down within the same second doesn't really sound super-plausible to me, but what do I know... /var/log/messages: Feb 28 13:50:07 avosetti kernel: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1087) Feb 28 13:50:07 avosetti kernel: CPU2: Core temperature above threshold, cpu clock throttled (total events = 1087) Feb 28 13:50:07 avosetti kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1872) Feb 28 13:50:07 avosetti kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1872) Feb 28 13:50:07 avosetti kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1872) Feb 28 13:50:07 avosetti kernel: mce: [Hardware Error]: Machine check events logged Feb 28 13:50:07 avosetti kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1872) Feb 28 13:50:07 avosetti kernel: mce: [Hardware Error]: Machine check events logged Feb 28 13:50:07 avosetti kernel: CPU3: Core temperature/speed normal Feb 28 13:50:07 avosetti kernel: CPU2: Core temperature/speed normal Feb 28 13:50:07 avosetti kernel: CPU1: Package temperature/speed normal Feb 28 13:50:07 avosetti kernel: CPU0: Package temperature/speed normal Feb 28 13:50:07 avosetti kernel: CPU2: Package temperature/speed normal Feb 28 13:50:07 avosetti kernel: CPU3: Package temperature/speed normal Feb 28 13:50:08 avosetti sh: abrt-dump-oops: Found oopses: 1 Feb 28 13:50:08 avosetti sh: abrt-dump-oops: Creating problem directories Feb 28 13:50:08 avosetti sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on Feb 28 13:50:09 avosetti abrt-dump-oops: Reported 1 kernel oopses to Abrt For slightly better time stamps: % dmesg -e [Feb28 13:50] CPU3: Core temperature above threshold, cpu clock throttled (total events = 1087) [ +0,000002] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1087) [ +0,000001] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1872) [ +0,000001] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1872) [ +0,000001] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1872) [ +0,000002] mce: [Hardware Error]: Machine check events logged [ +0,000005] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1872) [ +0,000002] mce: [Hardware Error]: Machine check events logged [ +0,000978] CPU3: Core temperature/speed normal [ +0,000001] CPU2: Core temperature/speed normal [ +0,000000] CPU1: Package temperature/speed normal [ +0,000001] CPU0: Package temperature/speed normal [ +0,000001] CPU2: Package temperature/speed normal [ +0,000004] CPU3: Package temperature/speed normal % mcelog Hardware event. This is not a software error. MCE 0 CPU 3 THERMAL EVENT TSC 25c769db5d390 TIME 1487593124 Mon Feb 20 14:18:44 2017 Processor 3 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010803 MCGSTATUS 0 MCGCAP 1000c07 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 1 CPU 2 THERMAL EVENT TSC 25c769db6234d TIME 1487593124 Mon Feb 20 14:18:44 2017 Processor 2 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010803 MCGSTATUS 0 MCGCAP 1000c07 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 2 CPU 3 THERMAL EVENT TSC 25c769ddd59d2 TIME 1487593124 Mon Feb 20 14:18:44 2017 Processor 3 below trip temperature. Throttling disabled STATUS 88020802 MCGSTATUS 0 MCGCAP 1000c07 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 3 CPU 2 THERMAL EVENT TSC 25c769ddd90dd TIME 1487593124 Mon Feb 20 14:18:44 2017 Processor 2 below trip temperature. Throttling disabled STATUS 88020802 MCGSTATUS 0 MCGCAP 1000c07 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 4 CPU 1 THERMAL EVENT TSC 4b9047804e441 TIME 1487849371 Thu Feb 23 13:29:31 2017 Processor 1 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010803 MCGSTATUS 0 MCGCAP 1000c07 APICID 1 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 5 CPU 0 THERMAL EVENT TSC 4b9047805565a TIME 1487849371 Thu Feb 23 13:29:31 2017 Processor 0 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010803 MCGSTATUS 0 MCGCAP 1000c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 6 CPU 1 THERMAL EVENT TSC 4b904782d73cb TIME 1487849371 Thu Feb 23 13:29:31 2017 Processor 1 below trip temperature. Throttling disabled STATUS 88020802 MCGSTATUS 0 MCGCAP 1000c07 APICID 1 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 7 CPU 0 THERMAL EVENT TSC 4b904782da5d2 TIME 1487849371 Thu Feb 23 13:29:31 2017 Processor 0 below trip temperature. Throttling disabled STATUS 88020802 MCGSTATUS 0 MCGCAP 1000c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 8 CPU 0 THERMAL EVENT TSC 4b9bec23240d1 TIME 1487849680 Thu Feb 23 13:34:40 2017 Processor 0 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010a83 MCGSTATUS 0 MCGCAP 1000c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 9 CPU 1 THERMAL EVENT TSC 4b9bec23318c5 TIME 1487849680 Thu Feb 23 13:34:40 2017 Processor 1 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010a83 MCGSTATUS 0 MCGCAP 1000c07 APICID 1 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 10 CPU 0 THERMAL EVENT TSC 4b9bec281fc8c TIME 1487849680 Thu Feb 23 13:34:40 2017 Processor 0 below trip temperature. Throttling disabled STATUS 88020a82 MCGSTATUS 0 MCGCAP 1000c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 11 CPU 1 THERMAL EVENT TSC 4b9bec282340e TIME 1487849680 Thu Feb 23 13:34:40 2017 Processor 1 below trip temperature. Throttling disabled STATUS 88020a82 MCGSTATUS 0 MCGCAP 1000c07 APICID 1 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 12 CPU 2 THERMAL EVENT TSC 4ba76050d785f TIME 1487849983 Thu Feb 23 13:39:43 2017 Processor 2 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010a83 MCGSTATUS 0 MCGCAP 1000c07 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 13 CPU 3 THERMAL EVENT TSC 4ba76050d9420 TIME 1487849983 Thu Feb 23 13:39:43 2017 Processor 3 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010a83 MCGSTATUS 0 MCGCAP 1000c07 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 14 CPU 3 THERMAL EVENT TSC 4ba7605ab417c TIME 1487849983 Thu Feb 23 13:39:43 2017 Processor 3 below trip temperature. Throttling disabled STATUS 88020a82 MCGSTATUS 0 MCGCAP 1000c07 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 15 CPU 2 THERMAL EVENT TSC 4ba7605ab9fe3 TIME 1487849983 Thu Feb 23 13:39:43 2017 Processor 2 below trip temperature. Throttling disabled STATUS 88020a82 MCGSTATUS 0 MCGCAP 1000c07 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 16 CPU 2 THERMAL EVENT TSC 8b7225a721587 TIME 1488282607 Tue Feb 28 13:50:07 2017 Processor 2 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010a83 MCGSTATUS 0 MCGCAP 1000c07 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 17 CPU 3 THERMAL EVENT TSC 8b7225a7269e9 TIME 1488282607 Tue Feb 28 13:50:07 2017 Processor 3 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88010a83 MCGSTATUS 0 MCGCAP 1000c07 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 18 CPU 2 THERMAL EVENT TSC 8b7225a994d17 TIME 1488282607 Tue Feb 28 13:50:07 2017 Processor 2 below trip temperature. Throttling disabled STATUS 88020a82 MCGSTATUS 0 MCGCAP 1000c07 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 Hardware event. This is not a software error. MCE 19 CPU 3 THERMAL EVENT TSC 8b7225a99845d TIME 1488282607 Tue Feb 28 13:50:07 2017 Processor 3 below trip temperature. Throttling disabled STATUS 88020a82 MCGSTATUS 0 MCGCAP 1000c07 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 61 % uname -a Linux avosetti.x.csc.fi 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux % cat /etc/redhat-release Red Hat Enterprise Linux Workstation release 7.3 (Maipo)
Happy 4th birthday, dear MCE-bug! Introduced in 2013, this bug is still there on Fedora 25. Today received a kernel upgrade to 4.10.5-200.fc25.x86_64, but the bug is still there. There seems to have been a slight change in the systemd-journald package, which now spams ALL consoles, no matter which user, no matter whether you log in on a framebuffer console or Konsole as X11 application. I changed that behaviour by doing: edit /etc/systemd/journald.conf ForwardToWall=no ForwardToConsole=no systemctl restart systemd-journald Still isn't there a way to make this bug go away? Fedora has has tried to mitigate the situation by disabling the Intel Turbo feature, yet, the MCE notification is still there: In /etc/profile.d/thermal-mce.sh echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo Is there anyway we can make someone at Red Hat and/or Intel aware of this problem and find a solution? AFAICS this specific bug has not yet found its way to LKML. I'd rather not post there, so maybe a more experienced person with some good standing on LKML can address this problem. I'd really like to use all capabilities of my X1 3rd gen to their fullest extent, including the Intel Turbo.
Addendum to previous posting: Despite the adaptation in the syslogd, the system still logs all MCE-notifications to Konsole. By now the whole problem has developed into a nightmare.
Got the same on a shiny new T460s. I get also some other memory errors, not sure if these are related or not. [158748.687661] CPU2: Core temperature above threshold, cpu clock throttled (total events = 124) [158748.687662] CPU0: Core temperature above threshold, cpu clock throttled (total events = 124) [158748.687665] CPU0: Package temperature above threshold, cpu clock throttled (total events = 124) [158748.687684] CPU2: Package temperature above threshold, cpu clock throttled (total events = 124) [158748.687687] CPU3: Package temperature above threshold, cpu clock throttled (total events = 124) [158748.687688] CPU1: Package temperature above threshold, cpu clock throttled (total events = 124) [158748.687689] mce_notify_irq: 1 callbacks suppressed [158748.687690] mce: [Hardware Error]: Machine check events logged [158748.687708] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008819280b [158748.687710] mce: [Hardware Error]: TSC 195756741aa83 [158748.687713] mce: [Hardware Error]: PROCESSOR 0:406e3 TIME 1491387946 SOCKET 0 APIC 0 microcode 9e [158748.687715] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 000000008819280b [158748.687716] mce: [Hardware Error]: TSC 1957567428bb8 [158748.687719] mce: [Hardware Error]: PROCESSOR 0:406e3 TIME 1491387946 SOCKET 0 APIC 1 microcode 9e
I'm seeing the same issue on a Dell Inspirion 7378 with an i7-7500u. I re-pasted the heatsink thinking that might help, but it didn't. This issue does need to go to LKML.
Same problem here :( System Information Manufacturer: LENOVO Product Name: 20BWS3D500 Version: ThinkPad T450s Any news about a specific thread on LKML?
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs. Fedora 25 has now been rebased to 4.10.9-100.fc24. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26. If you experience different issues, please open a new bug report for those.
Just testing the latest kernel (4.10.9-200.fc25.x86_64 I'm using testing repos) and still see the messages: [ 2664.911050] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) [ 2664.911051] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1) [ 2664.911054] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) [ 2664.911055] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) [ 2664.911061] mce: [Hardware Error]: Machine check events logged [ 2664.911079] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) [ 2664.911079] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) [ 2664.912116] CPU0: Core temperature/speed normal [ 2664.912117] CPU2: Core temperature/speed normal [ 2664.912118] CPU1: Package temperature/speed normal [ 2664.912118] CPU3: Package temperature/speed normal [ 2664.912119] CPU0: Package temperature/speed normal [ 2664.912120] mce: [Hardware Error]: Machine check events logged
PS: I'm changing version (F25) and bumping up the prio/sev
With kernel-4.10.8-200.fc25.x86_64 I'm no longer seeing the problem.
4.10.12-200.fc25.x86_64 and it's back.
I was able to recreate by running a very similar workflow to Stephen's within a container. I'm running Fedora 25 on a 5th gen X1 Carbon with kernel version 4.10.13-200.fc25.x86_64.
I confirm the bug on Thinkpad T470, running Fedora 25: Linux marek-t470 4.10.15-200.fc25.x86_64 #1 SMP Mon May 8 18:46:06 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Switched to F26 and never had this problem any more.
I'm also seeing this on a Thinkpad T470, running Fedora 25: Linux insh 4.10.15-200.fc25.x86_64 #1 SMP Mon May 8 18:46:06 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux About to upgrade to 4.10.16, but don't imagine it'll help based on the shortlog for this kernel.
Also seeing on a Thinkpad T470 running Fedora 25: 4.10.16-200.fc25.x86_64 #1 SMP Mon May 15 15:19:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Issue still exists on latest kernel for f25 4.11.3-200.fc25.x86_64
The machine check errors have disappeared for me. I'm on 4.11.3-200.fc25.x86_64. I'm still seeing thermal events, but that's to be expected. John
Gone away for me, with 4.11.6-201.fc25.x86_64
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Still an issue with Fedora 26 and 27 and latest kernels.
Same issue with kernel 4.14.0-1 from Fedora 28.
It also happens in the Hewlett-Packard HP EliteBook 8470p / 179B, BIOS 68ICF Ver. F.46 01/17/2014 5567.811064] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 128: 0000000088020002 [5567.811066] mce: [Hardware Error]: TSC f3374ab78e5 [5567.811068] mce: [Hardware Error]: PROCESSOR 0: 306a9 TIME 1511441212 SOCKET 0 APIC 3 microcode 1c [5567.811070] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 0000000088020002 [5567.811071] mce: [Hardware Error]: TSC f3374ab834d [5567.811073] mce: [Hardware Error]: PROCESSOR 0: 306a9 TIME 1511441212 SOCKET 0 APIC 2 microcode 1c [6129.118759] CPU2: Core temperature above threshold, cpu clock throttled (total events = 77) [6129.118761] CPU0: Package temperature above threshold, cpu clock throttled (total events = 77) [6129.118762] CPU1: Package temperature above threshold, cpu clock throttled (total events = 77) [6129.118763] CPU3: Core temperature above threshold, cpu clock throttled (total events = 77) [6129.118767] CPU3: Package temperature above threshold, cpu clock throttled (total events = 77) [6129.118768] mce_notify_irq: 1 suppressed callbacks [6129.118769] mce: [Hardware Error]: Machine check events logged [6129.118770] CPU2: Package temperature above threshold, cpu clock throttled (total events = 77) [6129.118771] mce: [Hardware Error]: Machine check events logged [6129.118782] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 128: 0000000088010283 [6129.118784] mce: [Hardware Error]: TSC 10baa371071f [6129.118788] mce: [Hardware Error]: PROCESSOR 0: 306a9 TIME 1511441774 SOCKET 0 APIC 3 microcode 1c [6129.118791] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 0000000088010283 [6129.118793] mce: [Hardware Error]: TSC 10baa3713edb [6129.118797] mce: [Hardware Error]: PROCESSOR 0: 306a9 TIME 1511441774 SOCKET 0 APIC 2 microcode 1c It 4.10.0-40-generic Distributor ID: Ubuntu Description: Ubuntu 16.04.3 LTS Release: 16.04 Codename: xenial
After a bit of a hiatus in these, I've had a couple on 4.14.11-300.fc27.x86_64
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. As kernel maintainers, we try to keep up with bugzilla but due the rate at which the upstream kernel project moves, bugs may be fixed without any indication to us. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs. Fedora 27 has now been rebased to 4.15.3-300.f27. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those.
I'm getting only the temperature warnings, a lot of them! - but no hardware events are logged (according to mcelog) on my ThinkPad T470 (Intel i7-7600U). Today's kern.log has this many messages on a current Fedora 27 system: $ dmesg -t | grep temperature | cut -d\ -f2-8 | sort | uniq -c | sort -n 120 Core temperature above threshold, cpu clock throttled 132 Core temperature/speed normal 260 Package temperature above threshold, cpu clock throttled 287 Package temperature/speed normal But, as there are no hardwre events logged on this particular machine, I'm not sure if this is even the same bug.
There really isn't anything to be done, this is working as expected. When the CPU temperature gets too hot, the correct behavior is to throttle the clock. It's annoying this gets logged but it's no longer generating an MCE log. I'm just going to close this bug.
As I earlier said, the fact that the CPUs ostensibly overheat and then again cool down within the same second doesn't really sound super-plausible to me, but what do I know...
Can the loglevel of these messages be adjusted though? I don't understand why these messages are logged with a priority of criticial, when (if I parse Laura's reply correctly) it should be "debug" at most: arch/x86/kernel/cpu/mcheck/therm_throt.c:187: pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", arch/x86/kernel/cpu/mcheck/therm_throt.c:195: pr_info("CPU%d: %s temperature/speed normal\n", this_cpu, Some stats over 90 minutes of usage (edited for readability) =================================================== # atop -r -P CPU -b 10:12 -e 11:30 | grep -v ^SEP TIME TPS # SYS USER NICE IDLE WT IRQ SIRQ S G FRQ FPCT 10:55:36 100 4 9939 21586 22 204930 139 2136 820 0 0 2551 65 11:05:36 100 4 11653 30937 0 193369 140 2304 1212 0 0 8969 229 11:15:36 100 4 10987 36769 0 187918 468 2349 1086 0 0 2823 72 11:25:36 100 4 8158 19757 1 209369 83 1803 685 0 0 2750 70 # journalctl -l -t kernel | egrep -c "$(date +%b\ %d)" 363 # journalctl -l -t kernel | egrep -c "$(date +%b\ %d).*CPU" 86 # journalctl -l -p crit -t kernel | egrep -c "$(date +%b\ %d).*CPU[0-9]:" 42
https://github.com/erpalma/lenovo-throttling-fix goes into more detail and certainly shows that indeed we have a bug and that CPU temperature wildly going up and down is not what's actually going on. I, too, would like to make use of the Intel turbo feature until the CPU actually reaches 100°C. Please re-open this as a bug.
I have the same problem on T480 - the temperature messages are the first ones when booting Fedora 28. Please reopen.
Same issue. I just ran some updates on a restart and now my computer is bricked. Fedora 28 on T480 I get temperature warnings as well.
(In reply to Dan.Kolbas from comment #43) > Same issue. I just ran some updates on a restart and now my computer is > bricked. > > Fedora 28 on T480 > > I get temperature warnings as well. Same here!!! F28 on T480 (Lenovo ThinkPad T480 (i7-8550U, MX150, FHD))
Same here! Dell Inspiron 15 7560
ho HP Pavilion 5335KV running Fedora 29, same problem reported on dmesg: ``` [15106.139924] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) [15106.139945] CPU4: Core temperature above threshold, cpu clock throttled (total events = 1) [15106.139947] CPU5: Package temperature above threshold, cpu clock throttled (total events = 11) [15106.139948] CPU2: Package temperature above threshold, cpu clock throttled (total events = 11) [15106.139949] CPU6: Package temperature above threshold, cpu clock throttled (total events = 11) [15106.139950] CPU1: Package temperature above threshold, cpu clock throttled (total events = 11) [15106.139952] CPU4: Package temperature above threshold, cpu clock throttled (total events = 11) [15106.139953] CPU7: Package temperature above threshold, cpu clock throttled (total events = 11) [15106.139954] CPU3: Package temperature above threshold, cpu clock throttled (total events = 11) [15106.139961] CPU0: Package temperature above threshold, cpu clock throttled (total events = 11) [15106.144987] CPU0: Core temperature/speed normal [15106.144988] CPU4: Core temperature/speed normal [15106.144989] CPU6: Package temperature/speed normal [15106.144990] CPU2: Package temperature/speed normal [15106.144991] CPU5: Package temperature/speed normal [15106.144991] CPU1: Package temperature/speed normal [15106.144992] CPU4: Package temperature/speed normal [15106.144993] CPU7: Package temperature/speed normal [15106.144994] CPU3: Package temperature/speed normal [15106.144995] CPU0: Package temperature/speed normal ```
I'm facing the same issue on a Thinkpad P50 and running RHEL7 [211952.288488] CPU7: Package temperature/speed normal [211952.288488] CPU6: Core temperature/speed normal [211952.288489] CPU1: Package temperature/speed normal [211952.288490] CPU2: Core temperature/speed normal [211952.288491] CPU5: Package temperature/speed normal [211952.288491] CPU3: Package temperature/speed normal [211952.288492] CPU6: Package temperature/speed normal [211952.288494] CPU2: Package temperature/speed normal [211952.288522] CPU0: Package temperature/speed normal [211952.288523] CPU4: Package temperature/speed normal [212365.270480] CPU4: Core temperature above threshold, cpu clock throttled (total events = 1860) [212365.270481] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1860) [212365.270483] CPU0: Package temperature above threshold, cpu clock throttled (total events = 2704) [212365.270486] CPU4: Package temperature above threshold, cpu clock throttled (total events = 2704) [212365.270521] CPU1: Package temperature above threshold, cpu clock throttled (total events = 2704) [212365.270522] CPU6: Package temperature above threshold, cpu clock throttled (total events = 2704) [212365.270523] CPU2: Package temperature above threshold, cpu clock throttled (total events = 2704) [212365.270524] CPU7: Package temperature above threshold, cpu clock throttled (total events = 2704) [212365.270525] CPU5: Package temperature above threshold, cpu clock throttled (total events = 2704) [212365.270526] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2704) [212365.271474] CPU4: Core temperature/speed normal [212365.271475] CPU1: Package temperature/speed normal [212365.271475] CPU0: Core temperature/speed normal [212365.271476] CPU5: Package temperature/speed normal [212365.271477] CPU0: Package temperature/speed normal [212365.271480] CPU3: Package temperature/speed normal [212365.271480] CPU7: Package temperature/speed normal [212365.271484] CPU4: Package temperature/speed normal [212365.271505] CPU6: Package temperature/speed normal [212365.271506] CPU2: Package temperature/speed normal [212775.336445] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1909) [212775.336446] CPU4: Core temperature above threshold, cpu clock throttled (total events = 1909) [212775.336448] CPU4: Package temperature above threshold, cpu clock throttled (total events = 2770) [212775.336450] CPU0: Package temperature above threshold, cpu clock throttled (total events = 2770) [212775.336485] CPU1: Package temperature above threshold, cpu clock throttled (total events = 2770) [212775.336486] CPU5: Package temperature above threshold, cpu clock throttled (total events = 2770) [212775.336487] CPU7: Package temperature above threshold, cpu clock throttled (total events = 2770) [212775.336488] CPU6: Package temperature above threshold, cpu clock throttled (total events = 2770) [212775.336489] CPU2: Package temperature above threshold, cpu clock throttled (total events = 2770) [212775.336490] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2770) [212775.337437] CPU2: Package temperature/speed normal
redhat 8.3 computer acer aspire v3 771g +0.015407] virbr0: port 1(virbr0-nic) entered blocking state [ +0.000005] virbr0: port 1(virbr0-nic) entered disabled state [ +0.000214] device virbr0-nic entered promiscuous mode [ +4.619444] virbr0: port 1(virbr0-nic) entered blocking state [ +0.000006] virbr0: port 1(virbr0-nic) entered listening state [ +0.487898] virbr0: port 1(virbr0-nic) entered disabled state [Apr18 14:30] nouveau 0000:01:00.0: therm: temperature (90 C) hit the 'fanboost' threshold [ +46.394386] Bluetooth: RFCOMM TTY layer initialized [ +0.000039] Bluetooth: RFCOMM socket layer initialized [ +0.000153] Bluetooth: RFCOMM ver 1.11 [ +4.347952] rfkill: input handler disabled [Apr18 14:32] nouveau 0000:01:00.0: therm: temperature (87 C) went below the 'fanboost' threshold [Apr18 14:33] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1) [ +0.000001] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) [ +0.000003] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) [ +0.000001] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) [ +0.000005] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) [ +0.000003] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) [ +0.002007] CPU1: Core temperature/speed normal [ +0.000001] CPU0: Core temperature/speed normal [ +0.000002] CPU2: Package temperature/speed normal [ +0.000001] CPU3: Package temperature/speed normal [ +0.000001] CPU0: Package temperature/speed normal [ +0.000001] CPU1: Package temperature/speed normal