Bug 525681 - Wrong temperature reported to the kernel causing Fedora to shutdown
Summary: Wrong temperature reported to the kernel causing Fedora to shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 11
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-25 09:39 UTC by Stephane GERARD
Modified: 2013-01-10 08:02 UTC (History)
6 users (show)

Fixed In Version: 2.6.30.9-90.fc11
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-10-30 17:57:47 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
excerpt from /var/log/messages from power on to now (55.99 KB, application/octet-stream)
2009-09-27 19:30 UTC, Stephane GERARD
no flags Details
result of lsmod (2.96 KB, text/plain)
2009-09-29 08:26 UTC, Stephane GERARD
no flags Details
excerpt from /var/log/messages with the problem (129.03 KB, application/octet-stream)
2009-09-29 08:38 UTC, Stephane GERARD
no flags Details

Description Stephane GERARD 2009-09-25 09:39:25 UTC
Description of problem:

Sometimes, Fedora shuts down suddenly. When I look in /var/log/messages to see what has triggered the shutdown, I find this kind of message :

Critical temperature reached (861929 C), shutting down.

This temperature is obviously wrong.


Version-Release number of selected component (if applicable):

Linux pc999.iihe.ac.be 2.6.30.5-43.fc11.x86_64

How reproducible:

It seems to occur randomly.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Excerpt from /var/log/messages :

----
Sep 25 10:02:52 pc999 kernel: ACPI: EC: missing confirmations, switch off interrupt mode.
Sep 25 10:02:52 pc999 kernel: ACPI Exception (evregion-0422): AE_TIME, Returned by Handler for [EmbeddedControl] [20090320]
Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.SBRG.EC0_.RCTP] (Node ffff88013fa4e280), AE_TIME
Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.RTMP] (Node ffff88013fa4ef60), AE_TIME
Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.TZ00._TMP] (Node ffff88013fa4f040), AE_TIME
Sep 25 10:02:52 pc999 kernel: Critical temperature reached (861929 C), shutting down.
----

Comment 1 Chuck Ebbert 2009-09-27 10:21:02 UTC
Looks like it can't read the temperature and is returning a random value. Does it make a difference whether you start from a powered-off state vs. rebooting a running system? And did this ever happen with the 2.6.29 kernel?

Sep 25 10:02:52 pc999 kernel: ACPI: EC: missing confirmations, switch off
interrupt mode.
Sep 25 10:02:52 pc999 kernel: ACPI Exception (evregion-0422): AE_TIME, Returned
by Handler for [EmbeddedControl] [20090320]
Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution
failed [\_SB_.PCI0.SBRG.EC0_.RCTP] (Node ffff88013fa4e280), AE_TIME
Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution
failed [\_TZ_.RTMP] (Node ffff88013fa4ef60), AE_TIME
Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution
failed [\_TZ_.TZ00._TMP] (Node ffff88013fa4f040), AE_TIME
Sep 25 10:02:52 pc999 kernel: Critical temperature reached (861929 C), shutting
down.

Comment 2 Matthew Garrett 2009-09-27 18:32:05 UTC
Do you have any temperature sensor modules loaded?

Comment 3 Stephane GERARD 2009-09-27 19:14:56 UTC
(In reply to comment #1)
> Looks like it can't read the temperature and is returning a random value. Does
> it make a difference whether you start from a powered-off state vs. rebooting a
> running system? And did this ever happen with the 2.6.29 kernel?
> 
> Sep 25 10:02:52 pc999 kernel: ACPI: EC: missing confirmations, switch off
> interrupt mode.
> Sep 25 10:02:52 pc999 kernel: ACPI Exception (evregion-0422): AE_TIME, Returned
> by Handler for [EmbeddedControl] [20090320]
> Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution
> failed [\_SB_.PCI0.SBRG.EC0_.RCTP] (Node ffff88013fa4e280), AE_TIME
> Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution
> failed [\_TZ_.RTMP] (Node ffff88013fa4ef60), AE_TIME
> Sep 25 10:02:52 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution
> failed [\_TZ_.TZ00._TMP] (Node ffff88013fa4f040), AE_TIME
> Sep 25 10:02:52 pc999 kernel: Critical temperature reached (861929 C), shutting
> down.  

It can do it after a cold start, it can do it after a reboot. Sometimes, it doesn't do it.

It never happened with 2.6.29 kernel.

Comment 4 Stephane GERARD 2009-09-27 19:18:15 UTC
(In reply to comment #2)
> Do you have any temperature sensor modules loaded?  

Yes, the temperature sensor modules must be loaded, since I can have CPU temperature with this command :

[root@pc999 ~]# cat /proc/acpi/thermal_zone/TZ00/temperature 
temperature:             39 C

Comment 5 Stephane GERARD 2009-09-27 19:30:13 UTC
Created attachment 362825 [details]
excerpt from /var/log/messages from power on to now

Comment 6 Chuck Ebbert 2009-09-28 17:02:05 UTC
(In reply to comment #4)

> > Do you have any temperature sensor modules loaded?  
> 
> Yes, the temperature sensor modules must be loaded, since I can have CPU
> temperature with this command :
> 
> [root@pc999 ~]# cat /proc/acpi/thermal_zone/TZ00/temperature 
> temperature:             39 C  

I think he means hwmon / lm-sensors drivers. Can you attach the output of 'lsmod' after getting a successful boot with 2.6.30?

Comment 7 Chuck Ebbert 2009-09-28 17:06:13 UTC
Can you attach the complete log from a session where it shut down because of the invalid temperature reading?

Comment 8 Stephane GERARD 2009-09-29 08:26:15 UTC
Created attachment 362974 [details]
result of lsmod

Result of the command lsmod after a normal boot

Comment 9 Stephane GERARD 2009-09-29 08:27:58 UTC
(In reply to comment #6)
> (In reply to comment #4)
> 
> > > Do you have any temperature sensor modules loaded?  
> > 
> > Yes, the temperature sensor modules must be loaded, since I can have CPU
> > temperature with this command :
> > 
> > [root@pc999 ~]# cat /proc/acpi/thermal_zone/TZ00/temperature 
> > temperature:             39 C  
> 
> I think he means hwmon / lm-sensors drivers. Can you attach the output of
> 'lsmod' after getting a successful boot with 2.6.30?  

I've just posted it in the attachments.

Comment 10 Stephane GERARD 2009-09-29 08:38:42 UTC
Created attachment 362975 [details]
excerpt from /var/log/messages with the problem

Comment 11 Stephane GERARD 2009-09-29 08:40:29 UTC
(In reply to comment #7)
> Can you attach the complete log from a session where it shut down because of
> the invalid temperature reading?  

See my third attachment.

Comment 12 Chuck Ebbert 2009-09-30 06:21:28 UTC
The EC starts in poll mode, switches to interrupt mode, then stops generating interrupts 17 minutes later:

Sep 29 10:11:12 pc999 kernel: ACPI: EC: GPE = 0x1c, I/O: command/status = 0x66, data = 0x62
Sep 29 10:11:12 pc999 kernel: ACPI: EC: driver started in poll mode
Sep 29 10:11:12 pc999 kernel: ACPI: EC: non-query interrupt received, switching to interrupt mode
Sep 29 10:28:25 pc999 kernel: ACPI: EC: missing confirmations, switch off interrupt mode.

Then the handlers time out trying to read the temperature, I think:

Sep 29 10:28:26 pc999 kernel: ACPI Exception (evregion-0422): AE_TIME, Returned by Handler for [EmbeddedControl] [20090320]
Sep 29 10:28:26 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.SBRG.EC0_.RCTP] (Node ffff88013fa4e280), AE_TIME
Sep 29 10:28:26 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.RTMP] (Node ffff88013fa4ef60), AE_TIME
Sep 29 10:28:26 pc999 kernel: ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.TZ00._TMP] (Node ffff88013fa4f040), AE_TIME
Sep 29 10:28:26 pc999 kernel: Critical temperature reached (1359203 C), shutting down.

It looks like a random temperature is returned...

Comment 13 Chuck Ebbert 2009-10-05 07:53:52 UTC
Could this be fixed by:

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=e12ac3d018dd8f20a075f552020

The patch causes commands to be retried when the EC is in poll mode.

Comment 14 Chuck Ebbert 2009-10-06 07:35:48 UTC
Added these patches in 2.6.30.9-74:
acpi-ec-merge-irq-and-poll-modes.patch
acpi-ec-restart-command-even-if-no-interrupts-from-ec.patch
acpi-ec-use-burst-mode-only-for-msi-notebooks.patch

This should fix the problem.

Comment 15 Chuck Ebbert 2009-10-08 16:53:29 UTC
Can you test the build from koji?

http://koji.fedoraproject.org/koji/buildinfo?buildID=135601

Comment 16 Stephane GERARD 2009-10-12 12:30:23 UTC
(In reply to comment #15)
> Can you test the build from koji?
> 
> http://koji.fedoraproject.org/koji/buildinfo?buildID=135601  

I'm now working with this build from koji since Friday and the problem disappeared. So, the problem seems to be fixed. Thank you very much.

Comment 17 Fedora Update System 2009-10-18 01:56:46 UTC
kernel-2.6.30.9-90.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/kernel-2.6.30.9-90.fc11

Comment 18 Agustin Barto 2009-10-26 11:15:26 UTC
I have a Dell Studio 1555 on an x86 kernel with the same problem. After it comes back from a suspended state, all ACPI temperatures are 0° C, which is dangerous because the fans are never turned on (I had a couple of instant shutdowns due to overheating). The latest kernel from updates-testing still has the problem downgrading to 2.6.30.8-64.fc11.i686.PAE solves the problem.

Comment 19 Fedora Update System 2009-10-27 06:46:40 UTC
kernel-2.6.30.9-90.fc11 has been pushed to the Fedora 11 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 20 Agustin Barto 2009-10-27 11:01:54 UTC
Yup, the problem persists:

[abarto@roadrunner ~]$ uname -r
2.6.30.9-90.fc11.i686.PAE
[abarto@roadrunner ~]$ sudo cat /proc/acpi/thermal_zone/TZ01/temperature 
temperature:             43 C

(after suspend/resume)

[abarto@roadrunner ~]$ sudo cat /proc/acpi/thermal_zone/TZ01/temperature 
temperature:             0 C

Comment 21 Chuck Ebbert 2009-10-30 17:56:49 UTC
(In reply to comment #20)
> Yup, the problem persists:
> 
> [abarto@roadrunner ~]$ uname -r
> 2.6.30.9-90.fc11.i686.PAE
> [abarto@roadrunner ~]$ sudo cat /proc/acpi/thermal_zone/TZ01/temperature 
> temperature:             43 C
> 
> (after suspend/resume)
> 
> [abarto@roadrunner ~]$ sudo cat /proc/acpi/thermal_zone/TZ01/temperature 
> temperature:             0 C  


That is not the original bug that was reported.


Note You need to log in before you can comment on or make changes to this bug.