Bug 1047061 - F20 causes "hardware error" reports and crashes
Summary: F20 causes "hardware error" reports and crashes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 20
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-12-28 19:12 UTC by Göran Uddeborg
Modified: 2014-05-21 19:57 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-05-21 19:57:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Göran Uddeborg 2013-12-28 19:12:02 UTC
Description of problem:
After upgrading the kernel and linux-firmware to F20,  I started to get error messages about hardware errors.  The host has been running fine since March, so I didn't immediately believe in those error messages.  But something was apparently wrong now, the host started to crash intermittently.

I dug a bit in the history, and experimented a bit trying older kernels and packages.  It seems if I run the latest F19 kernel I used, and if /usr/lib/firmware has the contents of the F19 linux-firmware I use, then things are back to stable and no crash.  But if I try an F20 kernel, or the F19 kernel with firmware from F20, then I get error messages and crashes.

Version-Release number of selected component (if applicable):
These fails:
  linux-firmware-20130724-31.git31f6b30.fc20.noarch
  kernel-3.12.5-302.fc20.x86_64
Using these the machine is stable:
  linux-firmware-20130418-0.1.gitb584174.fc19.src.rpm (only replacing /usr/lib/firmware, otherwise F20 package)
  kernel-3.10.11-200.fc19.x86_64


How reproducible:
The error messages start more or less immediately, and comes with irregular intervals.  The crash can come within an hour, or can take a day.  I guess it is somehow related to how the machine is used, but I haven't figured out any pattern.


Additional info:
The typical error messages I get looks like this:

dec 28 15:44:18 mimmi kernel: mce: [Hardware Error]: Machine check events logged

I tried mcelog to see if I could get more information.  When run, it says the CPU is unsupported, and "AMD Processor family 21: Please load edac_mce_amd module."  Doing a modprobe on that, I get error messages like this instead:

dec 27 23:12:17 mimmi kernel: [Hardware Error]: MC2 Error: WCC Data ECC error.
dec 27 23:12:18 mimmi kernel: [Hardware Error]: Error Status: Corrected error, no action required.
dec 27 23:12:18 mimmi kernel: [Hardware Error]: CPU:2 (15:10:1) MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc054000010a0145
dec 27 23:12:18 mimmi kernel: [Hardware Error]: MC2_ADDR: 0x0000000000000040
dec 27 23:12:18 mimmi kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR

These are the ones that the system survives.  I don't know what they would look like when it crashes.

The CPU is an AMD with combined CPU and GPU, AMD A8-5600K APU.  So when it comes to firmware, both CPU and GPU firmware might be involved.

I do get some error messages in the log when booting the working configuration.  I am slightly confused here, I am using the F19 kernel and F19 firmware, but they don't quite seem to match.  The missing TAHITI_uvd.bin and microcode_amd_fam15h.bin both exist in the newer linux-firmware.  But the machine is stable!

dec 28 16:25:20 mimmi kernel: [drm] radeon: 512M of VRAM memory ready
dec 28 16:25:20 mimmi kernel: [drm] radeon: 512M of GTT memory ready.
dec 28 16:25:20 mimmi kernel: radeon 0000:00:01.0: radeon_uvd: Can't load firmware "radeon/TAHITI_uvd.bin"
dec 28 16:25:20 mimmi kernel: [drm] GART: num cpu pages 131072, num gpu pages 131072
dec 28 16:25:20 mimmi kernel: [drm] Loading ARUBA Microcode
dec 28 16:25:20 mimmi kernel: [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).

dec 28 16:25:29 mimmi kernel: microcode: failed to load file amd-ucode/microcode_amd_fam15h.bin
dec 28 16:25:29 mimmi kernel: microcode: CPU1: patch_level=0x06001116
dec 28 16:25:29 mimmi kernel: microcode: CPU2: patch_level=0x06001116
dec 28 16:25:29 mimmi kernel: microcode: CPU3: patch_level=0x06001116

When starting a newer kernel with the old firmware, I don't see these error messages.  (Which is also confusing me.  I thought the new kernel would try to use them.)

Comment 1 Justin M. Forbes 2014-02-24 14:01:20 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.13.4-200.fc20.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 2 Göran Uddeborg 2014-02-26 20:35:47 UTC
I tried

kernel-3.13.4-200.fc20.x86_64
linux-firmware-20140131-35.gitd7f8a7c8.fc20.noarch

It crashed when about to start up X.  (I run the boot without the graphics in order to get a glimpse of any error messages.)

I also tried

kernel-3.10.11-200.fc19.x86_64
linux-firmware-20140131-35.gitd7f8a7c8.fc20.noarch

That too crashed.  Slightly earlier in the boot process.

So unfortunately it doesn't seem like the problem has disappeared.

Comment 3 Justin M. Forbes 2014-05-21 19:40:15 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.14.4-200.fc20.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 4 Göran Uddeborg 2014-05-21 19:57:06 UTC
I actually tried the then most recent kernel, 3.13.10-200.fc20.x86_64, a little while ago, and did manage to boot it.  After a couple of experiments with reboots, I now start to feel that this was not just a one time luck.  I think this bug has been fixed.


Note You need to log in before you can comment on or make changes to this bug.