Bug 155496

Summary: ACPI: Critical temperature reached (65 C), shutting down.
Product: [Fedora] Fedora Reporter: Michael Schwendt <bugs.michael>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED CANTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: rawhideCC: acpi-bugzilla, pfrields
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-07-23 14:30:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
DMI data for GA-7ZX BIOS release fg
none
acpidump output (pmtools-20050926)
none
dmesg
none
lspci
none
2.6.23-rc3+ patch to disable critical trip points on GA-7ZX none

Description Michael Schwendt 2005-04-20 20:52:23 UTC
With Fedora Core 4 Test 2 (also with FC4T1, but I forgot about that problem
after I had customised /proc/acpi/thermal_zone/THRM/trip_points as a
work-around) I experience the following kernel log messages and a quick
emergency shutdown:

  Critical temperature reached (65 C), shutting down.
  Critical temperature reached (64 C), shutting down.

After a fresh FC4T2 installation I didn't customise /proc/.../trip_points and
hence ran into this multiple times today while transferring pictures from a
digital camera. The kernel killed the machine already after a few minutes of
uptime in three consecutive attempts until I realised the reason for the shutdown.

With Fedora Core 3 and older on the same hardware, I've never (!) seen this
before, and I have not customised a different critical temperature there.

In /proc/acpi/thermal_zone/THRM/ with Fedora Core 3 I see:

cooling_mode
polling_frequency
state
temperature
trip_points

$ cat cooling_mode 
<setting not supported>
cooling mode:   passive

$ cat polling_frequency 
<polling disabled>

$ cat state 
state:                   ok

$ cat temperature 
temperature:             63 C

$ cat trip_points 
critical (S5):           65 C
passive:                 55 C: tc1=2 tc2=4 tsp=50 devices=0xeffee600 

With Fedora Core 4 Test 2, the only differences are:

$ cat state
state:                   passive 

$ cat trip_points
critical (S5):           65 C
passive:                 55 C: tc1=2 tc2=4 tsp=50 devices=0xeffeeb1c 


As a first observation, the critical temperature of 65 degrees Celsius doesn't
match the "ACPI Shut Down Temp." setting in the BIOS. I have "90 C" there since
last summer.

As a second observation, 65 C looks like a bad default for an AMD Athlon/Duron
chip, which are infamous for their high operating temperature. Although this is
a low-end desktop machine not used for any high load, 60-63 C is reached easily
according to /proc/.../temperature, and the BIOS reports an even higher
temperature (no sensors configured and running).


$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 7
model name      : AMD Duron(tm)
stepping        : 1
cpu MHz         : 1303.148
cache size      : 64 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 mtrr pge mca cmov pat pse36
mmx fxsr sse pni syscall mp mmxext 3dnowext 3dnow
bogomips        : 2580.48

BIOS firmware is last release and a few years old by now. :)

Comment 1 Michael Schwendt 2005-04-20 20:52:24 UTC
Created attachment 113437 [details]
DMI data for GA-7ZX BIOS release fg

Comment 2 Dave Jones 2005-10-06 05:29:27 UTC
any luck with the latest errata kernel ?


Comment 3 Dave Jones 2005-11-10 21:57:59 UTC
Mass update to all FC4 bugs:

An update has been released (2.6.14-1.1637_FC4) which rebases to a new upstream
kernel (2.6.13.2). As there were ~3500 changes upstream between this and the
previous kernel, it's possible your bug has been fixed already.

Please retest with this update, and update this bug if necessary.

Thanks.



Comment 4 Michael Schwendt 2005-11-11 12:48:46 UTC
Unchanged behaviour. Kernel still defaults to shutting down the machine at 65 C
unless I customise the trip_points. This max. value of 65 C doesn't match the
temperature threshold configured in the BIOS.

Comment 5 Konstantin Karasyov 2006-01-18 13:27:09 UTC
Could you please attach the output 
from acpidump, available in pmtools here:
http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/



Comment 6 Michael Schwendt 2006-01-18 16:44:45 UTC
Created attachment 123385 [details]
acpidump output (pmtools-20050926)

Comment 7 Dave Jones 2006-02-03 07:30:13 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 8 John Thacker 2006-05-04 13:42:04 UTC
Closing per previous comment.

Comment 9 Michael Schwendt 2006-05-05 07:35:24 UTC
# uname -a
Linux faldor.intranet 2.6.16-1.2107_FC4 #1 Tue May 2 19:15:13 EDT 2006 i686
athlon i386 GNU/Linux
# cat /proc/acpi/thermal_zone/THRM/trip_points 
critical (S5):           65 C
passive:                 55 C: tc1=2 tc2=4 tsp=50 devices=0xeffec760 

On the contrary:
BIOS ACPI Shutdown Temperature: 90 C

Comment 10 Dave Jones 2006-09-17 03:23:27 UTC
[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.


Comment 11 Dave Jones 2006-10-17 00:46:02 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 12 Dave Jones 2006-11-24 23:09:20 UTC
This bug has been mass-closed along with all other bugs that
have been in NEEDINFO state for several months.

Due to the large volume of inactive bugs in bugzilla, this
is the only method we have of cleaning out stale bug reports
where the reporter has disappeared.

If you can reproduce this bug after installing all the
current updates, please reopen this bug.

If you are not the reporter, you can add a comment requesting
it be reopened, and someone will get to it asap.

Thank you.

Comment 13 Michael Schwendt 2007-03-13 10:13:54 UTC
$ rpm -q kernel
kernel-2.6.19-1.2911.6.5.fc6


Comment 14 Michael Schwendt 2007-07-07 23:44:14 UTC
Reproducible in Fedora 7 and Rawhide:

$ rpm -q kernel
kernel-2.6.21-1.3240.fc8


Additionally, in Rawhide I'm unable to set the trip points:

# echo -n "85:0:80:60:0" > /proc/acpi/thermal_zone/THRM/trip_points
-bash: echo: write error: Invalid argument

As my machine is shut down, I cannot use and test Rawhide.


Comment 15 dominique 2007-07-09 09:46:18 UTC
the command is:
# echo  85:0:80:60:0 > /proc/acpi/thermal_zone/THRM/trip_points

Comment 16 Michael Schwendt 2007-07-09 10:01:24 UTC
That is no different except for the linefeed,
but it doesn't work either.

Comment 17 Michael Schwendt 2007-07-18 10:56:37 UTC
http://article.gmane.org/gmane.linux.acpi.devel/22750

The trip_points are read-only now. :-(

[...]

Where does the kernel take the wrong "65 C" value from?


Comment 18 Len Brown 2007-07-20 01:28:51 UTC
critical (S5):           65 C
passive:                 55 C: ...

mystery #1: 65C != BIOS SETUP ACPI critical shutdown temperature.

Can you actually change this field in the BIOS SETUP?
If yes, do the changes there have any effect at all
on what you see in the trip_points file?
(or any differences in the inb/outb command results requested below?)

mystery #2: FC3 didn't shut down, FC4 and later do shut down
this was the 2.6.9 -> 2.6.12 period.  It is quite possible
that something started working here that was broken in FC3.
Indeed, the fact that FC3 output show state=ok when the temperature
is 63 -- clearly above the 55C passive trip-point -- suggests
that it was FC3 that was actually broken.
 
BTW. Does this system have a fan?
Yes, 65C seems very low for a critical shutdown.
55C also seems quite low to throttle your processor.

please attach the output from dmesg -s64000
running the latest kernel you've got on hand.

please paste the output from lspci

Here is the ThermalZone in the DSDT:

    OperationRegion (FNOR, SystemIO, 0x084C, 0x04)
    Scope (\_TZ)
    {
        Name (THBF, Buffer (0x04)
        {
            0x00, 0x00, 0x00, 0x00
        })
        Method (KELV, 1, NotSerialized)
        {
            And (Arg0, 0xFF, Local0)
            Multiply (Local0, 0x0A, Local0)
            Add (Local0, 0x0AAC, Local0)
            Return (Local0)
        }

        Name (PLCY, 0x00)

// unclear why PLCY is a variable, as AML doesn't write it
// it is used below to select the passive trip-point

        OperationRegion (THOR, SystemIO, 0x72, 0x02)
        Field (THOR, ByteAcc, NoLock, Preserve)
        {
            ECMI,   8,
            ECMD,   8
        }

        IndexField (ECMI, ECMD, ByteAcc, NoLock, Preserve)
        {
                    Offset (0xF0),
            TMIN,   8,
            TMAX,   8,
            TCRT,   8
        }

// These are the fields we want to see

        Name (TSP, 0x05)
        Name (TC1, 0x02)
        Name (TC2, 0x04)
        OperationRegion (TSN1, SystemIO, 0x0C20, 0x01)
        Field (TSN1, ByteAcc, NoLock, Preserve)
        {
            CURT,   8
        }

        Method (TCHG, 0, NotSerialized)
        {
            Noop
        }

        Method (RTMP, 0, NotSerialized)
        {
            Not (CURT, Local0)
            Subtract (Local0, 0xB3, Local0)
            Not (Local0, Local0)
            Add (Local0, 0x01, Local0)
            And (Local0, 0xFF, Local0)
            Store (Local0, Local1)
            Divide (Local0, 0x0A, Local0, Local2)
            Subtract (Local1, Local2, Local0)
            ShiftRight (Local0, 0x01, Local0)
            Store (Local0, DBG8)
            Return (Local0)
        }
        ThermalZone (THRM)
        {
            Method (_CRT, 0, NotSerialized)
            {
                Return (KELV (TCRT))
            }

            Method (_TMP, 0, NotSerialized)
            {
                If (LEqual (TCRT, 0x4F))
                {
                    Return (KELV (0x1E))
                }
                Else
                {
                    Return (KELV (RTMP ()))
                }
            }

            Name (_PSL, Package (0x01)
            {
                \_PR.CPU1
            })
            Method (_TSP, 0, NotSerialized)
            {
                Multiply (TSP, 0x0A, Local0)
                Return (Local0)
            }

            Method (_TC1, 0, NotSerialized)
            {
                Return (TC1)
            }

            Method (_TC2, 0, NotSerialized)
            {
                Return (TC2)
            }

            Method (_PSV, 0, NotSerialized)
            {
                If (PLCY)
                {
                    Return (KELV (TMIN))
                }
                Else
                {
                    Return (KELV (TMAX))
                }
            }
        }
    }

please paste the output from these commands:
# outb 0xF0 0x72
# inb 0x73
this should give us TMIN
# outb 0xF1 0x72
# inb 0x73
this should give us TMAX
# outb 0xF2 0x72
# inb 0x73
This should give us TCRT 

If you repeat this, you should get the same answers.
It would be good to verify also that you get the same
answers with "acpi=off".

BTW. booting with "acpi=off" should work-around the symptom of this bug.

Comment 19 Michael Schwendt 2007-07-20 09:09:05 UTC
Created attachment 159636 [details]
dmesg

Comment 20 Michael Schwendt 2007-07-20 09:09:38 UTC
Created attachment 159637 [details]
lspci

Comment 21 Michael Schwendt 2007-07-20 09:12:03 UTC
> mystery #1: 65C != BIOS SETUP ACPI critical shutdown temperature.
> 
> Can you actually change this field in the BIOS SETUP?

Yes, I can choose from "disabled, 70 C, 80 C and 90 C".

> If yes, do the changes there have any effect at all
> on what you see in the trip_points file?

Seems so.
90 C maps to 65 C critical, 55 C passive
80 C maps to 60 C critical, 50 C passive

Looks like factor /2 is involved somewhere.

> BTW. Does this system have a fan?

CPU fan and power fan.

> please paste the output from these commands:
> # outb 0xF0 0x72
> inb 0x73

Where do I get the commands? I've done it in C as a work-around:

# ~misc/files/source/inb_outb 
55
55
65

> If you repeat this, you should get the same answers.

Yes.

> It would be good to verify also that you get the same
> answers with "acpi=off".

I do.

> BTW. booting with "acpi=off" should work-around the symptom of this bug.

Been doing that with Rawhide since it came into my mind, too.


Comment 22 Len Brown 2007-07-20 16:13:09 UTC
> # ~misc/files/source/inb_outb 
> 55
> 55
> 65

Assuming this is the case with 65 critical and 55 passive,
this confirms that Linux/ACPI/AML are correctly reading
and acting on the underlying memory locations where the BIOS
is storing these trip points

> Yes, I can choose from "disabled, 70 C, 80 C and 90 C".

> 90 C maps to 65 C critical, 55 C passive
> 80 C maps to 60 C critical, 50 C passive

If you request 70 I assume you get a critical shutdown during boot?
What if you request 70, boot "acpi=off" and run inb_outb?
Let me guess, we get 55 critical and 45 passive?

What do you see if you request "disabled"?
What is the default setting for this parameter if you globally
reset the BIOS to SETUP defaults?

What do you see if you modify inb_outb to do this:

# outb 0xF0 0x72
# outb 0x73 0x55 // set TMIN to 85C = 0x55

# outb 0xF0 0x72
# inb 0x73 // this should give us TMIN

# outb 0xF1 0x72
# outb 0x73 0x55 // set TMAX to 86 = 0x56

# outb 0xF1 0x72
# inb 0x73 // this should give us TMAX

# outb 0xF2 0x72
# outb 0x73 0x5A // set TCRT to 90 = 0x5A

# outb 0xF2 0x72
# inb 0x73 // this should give us TCRT

you might need a temperature event to coax Linux/ACPI to re-evaluate
these trip-points, which should re-read them from memory.
possibly setting /proc/acpi/thermal_zone/.../polling_frequency
to a non-zero value for a bit would be enough to make this happen.

However, the real question is where the EC/temperature-sensors
on this box are going to trip.  Are they tripping at the
BIOS SETUP points, the points in TMIN, TCRT, or some other
values that we don't see?

You should be able to determine this by disabling the critical shutdown:
# mv /sbin/poweroff /sbin/poweroff.orig
and enabling monitoring of ACPI events:
kill acpid
# cat /proc/acpi/event

then poll the temperature in /proc/acpi/thermal_zone
run something to heat up the system, and see if
an ACPI critical trip point event
actually occur at the trip point specified or not.

It is possible that your original critical shutdown issue
was not actually triggered by a critical shutdown event,
but a mundane temperature change event that caused Linux
to compare current temperature vs the (bogus) critical shutdown
point.

The other mystery, of course, is what Windows does.
It looks like Linux/ACPI are reading the hardware properly,
so it might be that Windows has a platform-specific workaround
that applies to this chipset or this BIOS.  I wonder if Windows
has a mechanism where they display the trip points -- if so,
it would be interesting to see what they display for each
of the BIOS SETUP selections...





Comment 23 Michael Schwendt 2007-07-23 14:30:34 UTC
Closing as CANTFIX, since if it's a mainboard/BIOS bug, it is beyond
my time to deal with it.

Comment 24 Len Brown 2007-07-23 17:45:19 UTC
Michael,
As we're not quite at the bottom of this sighting...
if you file a new sighting at bugzilla.kernel.org
vs ACPI/thermal, drop in the URL of this report,
we can resume poking at it there...

Comment 25 Len Brown 2007-08-20 22:52:27 UTC
Created attachment 161932 [details]
2.6.23-rc3+ patch to disable critical trip points on GA-7ZX

I've added the attached DMI entry to the acpi-test tree
to disable ACPI critical trip point actions on this board.
and expect it to ship upstream in 2.6.23.

Comment 26 Len Brown 2007-08-26 04:22:23 UTC
The patch in comment #25 shipped in Linux-2.6.23-rc3-git9

Comment 27 Michael Schwendt 2007-08-27 10:21:06 UTC
Works for me. Thank you!

$ uname -a
Linux rawhide.intranet 2.6.23-0.139.rc3.git10.fc8 #1 SMP Sun Aug 26 19:53:26 EDT
2007 i686 athlon i386 GNU/Linux

$ grep Giga /var/log/dmesg 
ACPI: Gigabyte GA-7ZX detected: disabling all critical thermal trip point actions.

$ cat /proc/acpi/thermal_zone/THRM/trip_points 
critical (S5):           65 C <disabled>
passive:                 55 C: tc1=2 tc2=4 tsp=50 devices=CPU1