Red Hat Bugzilla – Bug 1301739
Machine Check exceptions related to transient temperature spikes get reported to abrt.
Last modified: 2017-04-11 20:34:56 EDT
Created attachment 1118230 [details]
dmesg log from a Lenovo Thinkpas T450s
Description of problem:
I've got abrt regularlly reporting machine-check errors about temperature over-threshold that lasts for ~0.0011 seconds. This seems like it should be something reported via another method. It has gotten so that I just routinely ignore the abrt messages because they always seem to be this.
Several times per day
Steps to Reproduce:
1. Cause the machine to heat up (this often happens running graphically intensive games).
2. Observe the MCE reported to abrt.
MCE reported to abrt
Ideally, it would be nice if the error was only reported to abrt if it wasn't transient (or perhaps for the threshold for throttling being lower than the threshold for reporting). It would also be nice if incidents were record to be able to see if the frequency of occurrence is significant.
But, I'd be happy if abrt simply ignored these errors, as I have abrt-fatigue from them.
The messages are:
[4942478.364568] CPU3: Package temperature above threshold, cpu clock throttled (total events = 289096)
[4942478.364579] CPU0: Package temperature above threshold, cpu clock throttled (total events = 289098)
[4942478.364581] CPU1: Package temperature above threshold, cpu clock throttled (total events = 289098)
[4942478.364584] CPU2: Package temperature above threshold, cpu clock throttled (total events = 289098)
[4942478.365577] CPU3: Package temperature/speed normal
[4942478.365578] CPU2: Package temperature/speed normal
[4942478.365580] CPU0: Package temperature/speed normal
[4942478.365590] CPU1: Package temperature/speed normal
I do experience this issue on well-cooled, mostly idle desktop form-factor with Intel(R) Core(TM) i5-2400S CPU @ 2.50GHz, 4.2.6-301.fc23.x86_64
So there are two issues here. The first is that the kernel is simply doing its job and is reporting the events. That they are of an extremely short duration and kind of spammy is a downside, but it isn't incorrect. The second issue is that abrt is triggering on them, but likely because of the mce being logged, not the temp messages themselves.
There most suitable workaround here is for abrt to not trigger on thermal events of such a short duration. However, I doubt it is even looking at what caused the mce and it might not be easy for abrt to do that. Will need to think so more.
Just updating to note this is still present in Fedora 24, at least on my T460s.
(Also, hi Josh, long time no talk).
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.
Fedora 25 has now been rebased to 4.10.9-100.fc24. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.
If you experience different issues, please open a new bug report for those.
I still see these frequently with Fedora 25 and kernel-4.10.8-200.fc25.x86_64. I'll try 4.10.9 when it arrives.