Bug 1085785 - Kernel panic: Fatal Machine Check after booting >= 3.13.5-101.fc19.x86_64
Summary: Kernel panic: Fatal Machine Check after booting >= 3.13.5-101.fc19.x86_64
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-ati
Version: 19
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: X/OpenGL Maintenance List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-09 11:10 UTC by Matthias
Modified: 2015-02-17 20:08 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-02-17 20:08:57 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
messages output for booting into failing kernel and reboot into working kernel (255.18 KB, text/plain)
2014-04-09 11:10 UTC, Matthias
no flags Details

Description Matthias 2014-04-09 11:10:33 UTC
Created attachment 884475 [details]
messages output for booting into failing kernel and reboot into working kernel

Description of problem:
Screen freezes a few seconds after Gnome appears (after booting). The error message (kernel panic: machine check exception, see below) is seldom still printed to the screen.

Booting 3.12.11-201.fc19.x86_64 with otherwise the same setup, I do not see the panic (last working). All later releases produce the problem (from 3.13.5-101.fc19.x86_64 to the current 3.13.9-100.fc19.x86_64).

Booting on different hardware (my laptop) does not produce the panic. Also, replacing the graphics card helps to avoid the panic. This strongly suggests a graphics related problem!

My graphics card: Sapphire ATI Radeon HD 4830 (RV770 chip).

I tried booting into runlevel 3 (text mode), but the error persists.

I also noticed that in 3.13.9-100, the error always occurs right after _logging in_, not a few seconds after the gnome screen appears, as in earlier versions. Has something been shifted from loading only after the login, when before it was done before the login?

I am unsure whether this is related, but I was also affected by the following bug: https://bugs.freedesktop.org/show_bug.cgi?id=44099

I attached /var/log/messages, the relevant part. Error right after
Apr  9 11:43:28

I also filed a kernel bug report:
http://lkml.iu.edu//hypermail/linux/kernel/1403.2/03734.html
Subject Name:
PROBLEM: Fatal Machine Check >= 3.13.5-101.fc19.x86_64
21.03.2014


The Kernel Panic Screen Output:
[ 34.348483] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 0: b200004000000800
[ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
[ 44.468168] mce: [Hardware Error]: TSC 365779ad0c
[ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 2 microcode ba
[ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 44.468168] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: b200220024080400
[ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
[ 44.468168] mce: [Hardware Error]: TSC 365779ad0c
[ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 2 microcode ba
[ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 44.468168] mce: [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 0: b200004000000800
[ 44.468168] mce: [Hardware Error]: TSC 365779ad42
[ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 3 microcode ba
[ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 44.468168] mce: [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 5: b200220010040400
[ 44.468168] mce: [Hardware Error]: TSC 365779ad42
[ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 3 microcode ba
[ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 44.468168] mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 0: b200004000000800
[ 44.468168] mce: [Hardware Error]: TSC 365779aeaa
[ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 1 microcode ba
[ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 44.468168] mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 5: b200221010040400
[ 44.468168] mce: [Hardware Error]: TSC 365779aeaa
[ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 1 microcode ba
[ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 44.468168] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: b200221024080400
[ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
[ 44.468168] mce: [Hardware Error]: TSC 365779aece
[ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 0 microcode ba
[ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 44.468168] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 0: b200004000000800
[ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
[ 44.468168] mce: [Hardware Error]: TSC 365779aece
[ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 0 microcode ba
[ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 44.468168] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 44.468168] Kernel panic — not syncing: Fatal Machine check
[ 44.468168] drm_kms_helper: panic occurred, switching back to text console
[ 44.468168] Rebooting in 30 seconds..



MCElog output for the above:

Hardware event. This is not a software error.
CPU 3 BANK 0 
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS b200004000000800 MCGSTATUS 5


Hardware event. This is not a software error.
CPU 3 BANK 5 
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200220024080400 MCGSTATUS 5


Hardware event. This is not a software error.
CPU 1 BANK 0 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS b200004000000800 MCGSTATUS 4


Hardware event. This is not a software error.
CPU 1 BANK 5 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200220010040400 MCGSTATUS 4


Hardware event. This is not a software error.
CPU 2 BANK 0 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS b200004000000800 MCGSTATUS 4


Hardware event. This is not a software error.
CPU 2 BANK 5 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221010040400 MCGSTATUS 4

Hardware event. This is not a software error.
CPU 0 BANK 5 
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5


Hardware event. This is not a software error.
CPU 0 BANK 0 
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS b200004000000800 MCGSTATUS 5




May be relevant:

On Fri, Mar 21, 2014 at 1:13 PM, Borislav Petkov <bp> wrote:
> Provided the decode is correct and I'm reading it right, this looks
> like the cores get to livelock for some reason without any forward
> progress. The MCEs signal that there hasn't been any instruction retired
> in relatively long time, thus a stall.

Agreed. There are some bus level errors (low 16 bits of STATUS 0x0800)
and some timeout (low bits 0x0400)

> You say, this happens when gnome starts. Does it also happen if you
> don't start gnome, i.e. don't start X at all? Try booting into a
> runlevel without graphics.
>
> Tony, any other ideas?

My best guess is graphics? driver making wild access to some i/o regs that
never respond.  If booting without graphics works, then that adds some
weight to the theory.

Other useful tests would be to check upstream kernels 3.12, 3.13 to
see if something is odd in the Fedora additions. And 3.14-rc7 to see
if it is already fixed upstream.

If upstream 3.12 works and 3.13 breaks (and not fixed in 3.14-rc7) ...
then bisecting between 3.12 and 3.13 would be helpful.

-Tony

Comment 1 Matthias 2014-04-18 12:13:35 UTC
Fine-grained bisection result:

ab70b1dde73ff4525c3cd51090c233482c50f217 is the first bad commit
commit ab70b1dde73ff4525c3cd51090c233482c50f217
Author: Alex Deucher <alexander.deucher>
Date:   Fri Nov 1 15:16:02 2013 -0400

    drm/radeon: enable DPM by default on r7xx asics

    Seems to be stable on them.

    Signed-off-by: Alex Deucher <alexander.deucher>

:040000 040000 f3262029b868df4d882f64b4deba6b9230e307ea
1f1dfca42763703a56e3cc82bb103608a24be94e M	drivers

Patch that resolved the issue:


diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c
index ee738a524639..af693c4746da 100644
--- a/drivers/gpu/drm/radeon/radeon_pm.c
+++ b/drivers/gpu/drm/radeon/radeon_pm.c
@@ -1257,6 +1257,10 @@ int radeon_pm_init(struct radeon_device *rdev)
 	case CHIP_RV670:
 	case CHIP_RS780:
 	case CHIP_RS880:
+	case CHIP_RV770:
+	case CHIP_RV730:
+	case CHIP_RV710:
+	case CHIP_RV740:
 	case CHIP_BARTS:
 	case CHIP_TURKS:
 	case CHIP_CAICOS:
@@ -1273,10 +1277,6 @@ int radeon_pm_init(struct radeon_device *rdev)
 		else
 			rdev->pm.pm_method = PM_METHOD_PROFILE;
 		break;
-	case CHIP_RV770:
-	case CHIP_RV730:
-	case CHIP_RV710:
-	case CHIP_RV740:
 	case CHIP_CEDAR:
 	case CHIP_REDWOOD:
 	case CHIP_JUNIPER:

Comment 2 Fedora End Of Life 2015-01-09 21:17:38 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 3 Fedora End Of Life 2015-02-17 20:08:57 UTC
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.