Created attachment 884475 [details] messages output for booting into failing kernel and reboot into working kernel Description of problem: Screen freezes a few seconds after Gnome appears (after booting). The error message (kernel panic: machine check exception, see below) is seldom still printed to the screen. Booting 3.12.11-201.fc19.x86_64 with otherwise the same setup, I do not see the panic (last working). All later releases produce the problem (from 3.13.5-101.fc19.x86_64 to the current 3.13.9-100.fc19.x86_64). Booting on different hardware (my laptop) does not produce the panic. Also, replacing the graphics card helps to avoid the panic. This strongly suggests a graphics related problem! My graphics card: Sapphire ATI Radeon HD 4830 (RV770 chip). I tried booting into runlevel 3 (text mode), but the error persists. I also noticed that in 3.13.9-100, the error always occurs right after _logging in_, not a few seconds after the gnome screen appears, as in earlier versions. Has something been shifted from loading only after the login, when before it was done before the login? I am unsure whether this is related, but I was also affected by the following bug: https://bugs.freedesktop.org/show_bug.cgi?id=44099 I attached /var/log/messages, the relevant part. Error right after Apr 9 11:43:28 I also filed a kernel bug report: http://lkml.iu.edu//hypermail/linux/kernel/1403.2/03734.html Subject Name: PROBLEM: Fatal Machine Check >= 3.13.5-101.fc19.x86_64 21.03.2014 The Kernel Panic Screen Output: [ 34.348483] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 0: b200004000000800 [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80} [ 44.468168] mce: [Hardware Error]: TSC 365779ad0c [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 2 microcode ba [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 44.468168] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: b200220024080400 [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80} [ 44.468168] mce: [Hardware Error]: TSC 365779ad0c [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 2 microcode ba [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 44.468168] mce: [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 0: b200004000000800 [ 44.468168] mce: [Hardware Error]: TSC 365779ad42 [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 3 microcode ba [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 44.468168] mce: [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 5: b200220010040400 [ 44.468168] mce: [Hardware Error]: TSC 365779ad42 [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 3 microcode ba [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 44.468168] mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 0: b200004000000800 [ 44.468168] mce: [Hardware Error]: TSC 365779aeaa [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 1 microcode ba [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 44.468168] mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 5: b200221010040400 [ 44.468168] mce: [Hardware Error]: TSC 365779aeaa [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 1 microcode ba [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 44.468168] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: b200221024080400 [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80} [ 44.468168] mce: [Hardware Error]: TSC 365779aece [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 0 microcode ba [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 44.468168] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 0: b200004000000800 [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80} [ 44.468168] mce: [Hardware Error]: TSC 365779aece [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 0 microcode ba [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 44.468168] mce: [Hardware Error]: Machine check: Processor context corrupt [ 44.468168] Kernel panic — not syncing: Fatal Machine check [ 44.468168] drm_kms_helper: panic occurred, switching back to text console [ 44.468168] Rebooting in 30 seconds.. MCElog output for the above: Hardware event. This is not a software error. CPU 3 BANK 0 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE timeout BINIT (ROB timeout). No micro-instruction retired for some time STATUS b200004000000800 MCGSTATUS 5 Hardware event. This is not a software error. CPU 3 BANK 5 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200220024080400 MCGSTATUS 5 Hardware event. This is not a software error. CPU 1 BANK 0 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE timeout BINIT (ROB timeout). No micro-instruction retired for some time STATUS b200004000000800 MCGSTATUS 4 Hardware event. This is not a software error. CPU 1 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200220010040400 MCGSTATUS 4 Hardware event. This is not a software error. CPU 2 BANK 0 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE timeout BINIT (ROB timeout). No micro-instruction retired for some time STATUS b200004000000800 MCGSTATUS 4 Hardware event. This is not a software error. CPU 2 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221010040400 MCGSTATUS 4 Hardware event. This is not a software error. CPU 0 BANK 5 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200221024080400 MCGSTATUS 5 Hardware event. This is not a software error. CPU 0 BANK 0 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE timeout BINIT (ROB timeout). No micro-instruction retired for some time STATUS b200004000000800 MCGSTATUS 5 May be relevant: On Fri, Mar 21, 2014 at 1:13 PM, Borislav Petkov <bp> wrote: > Provided the decode is correct and I'm reading it right, this looks > like the cores get to livelock for some reason without any forward > progress. The MCEs signal that there hasn't been any instruction retired > in relatively long time, thus a stall. Agreed. There are some bus level errors (low 16 bits of STATUS 0x0800) and some timeout (low bits 0x0400) > You say, this happens when gnome starts. Does it also happen if you > don't start gnome, i.e. don't start X at all? Try booting into a > runlevel without graphics. > > Tony, any other ideas? My best guess is graphics? driver making wild access to some i/o regs that never respond. If booting without graphics works, then that adds some weight to the theory. Other useful tests would be to check upstream kernels 3.12, 3.13 to see if something is odd in the Fedora additions. And 3.14-rc7 to see if it is already fixed upstream. If upstream 3.12 works and 3.13 breaks (and not fixed in 3.14-rc7) ... then bisecting between 3.12 and 3.13 would be helpful. -Tony
Fine-grained bisection result: ab70b1dde73ff4525c3cd51090c233482c50f217 is the first bad commit commit ab70b1dde73ff4525c3cd51090c233482c50f217 Author: Alex Deucher <alexander.deucher> Date: Fri Nov 1 15:16:02 2013 -0400 drm/radeon: enable DPM by default on r7xx asics Seems to be stable on them. Signed-off-by: Alex Deucher <alexander.deucher> :040000 040000 f3262029b868df4d882f64b4deba6b9230e307ea 1f1dfca42763703a56e3cc82bb103608a24be94e M drivers Patch that resolved the issue: diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c index ee738a524639..af693c4746da 100644 --- a/drivers/gpu/drm/radeon/radeon_pm.c +++ b/drivers/gpu/drm/radeon/radeon_pm.c @@ -1257,6 +1257,10 @@ int radeon_pm_init(struct radeon_device *rdev) case CHIP_RV670: case CHIP_RS780: case CHIP_RS880: + case CHIP_RV770: + case CHIP_RV730: + case CHIP_RV710: + case CHIP_RV740: case CHIP_BARTS: case CHIP_TURKS: case CHIP_CAICOS: @@ -1273,10 +1277,6 @@ int radeon_pm_init(struct radeon_device *rdev) else rdev->pm.pm_method = PM_METHOD_PROFILE; break; - case CHIP_RV770: - case CHIP_RV730: - case CHIP_RV710: - case CHIP_RV740: case CHIP_CEDAR: case CHIP_REDWOOD: case CHIP_JUNIPER:
This message is a notice that Fedora 19 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 19. It is Fedora's policy to close all bug reports from releases that are no longer maintained. Approximately 4 (four) weeks from now this bug will be closed as EOL if it remains open with a Fedora 'version' of '19'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 19 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.