875194 – sbridge: HANDLING MCE MEMORY ERROR

Bug 875194 - sbridge: HANDLING MCE MEMORY ERROR

Summary: sbridge: HANDLING MCE MEMORY ERROR

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Linda Wang
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-09 18:55 UTC by joshua
Modified:	2018-05-03 01:31 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-08-01 03:26:22 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description joshua 2012-11-09 18:55:21 UTC

Description of problem:

Fedora 17 x86_64 on new server with 192 gigs of memory.  No actual problems to report, save these disturbing messages:

[ 3063.105128] sbridge: HANDLING MCE MEMORY ERROR
[ 3063.105139] CPU 4: Machine Check Exception: 0 Bank 5: 8c00004000010092
[ 3063.105141] TSC 0 ADDR 2f050eb080 MISC 24403ebe86 PROCESSOR 0:206d7 TIME 1352412000 SOCKET 1 APIC 20
[ 3064.017116] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Channel#2_DIMM#2 (channel:2 slot:2 page:0x2f050eb offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0092 socket:1 channel_mask:2 rank:8)
[ 5158.195757] sbridge: HANDLING MCE MEMORY ERROR
[ 5158.195765] CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010092
[ 5158.195767] TSC 0 ADDR 14a20fe680 MISC 2440242486 PROCESSOR 0:206d7 TIME 1352414098 SOCKET 0 APIC 0
[ 5158.997907] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#2_DIMM#0 (channel:2 slot:0 page:0x14a20fe offset:0x680 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0092 socket:0 channel_mask:2 rank:1)
[ 5390.677168] sbridge: HANDLING MCE MEMORY ERROR
[ 5390.677178] CPU 4: Machine Check Exception: 0 Bank 5: 8c00004000010092
[ 5390.677180] TSC 0 ADDR 2e1dabfd80 MISC 24404a4a86 PROCESSOR 0:206d7 TIME 1352414331 SOCKET 1 APIC 20
[ 5391.106704] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Channel#2_DIMM#0 (channel:2 slot:0 page:0x2e1dabf offset:0xd80 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0092 socket:1 channel_mask:8 rank:1)
etc
etc
etc
etc

memtest86+ runs for days with no errors... so is this just too much kernel verbosity, or something with which I should really be concerned?

Version-Release number of selected component (if applicable):

kernel-3.6.3-1.fc17.x86_64
all updates as of 11/09/2012
Motherboard: http://www.supermicro.com/products/motherboard/xeon/c600/x9dri-ln4f_.cfm

Comment 1 Dave Jones 2013-01-04 20:12:43 UTC

running the latest BIOS ?

Comment 2 joshua 2013-01-04 20:18:28 UTC

I think so, but I'll check.  Does BIOS matter at this point for memory access?

Comment 3 Josh Boyer 2013-01-07 21:43:56 UTC

Mauro, any thoughts here?

Comment 4 Fedora End Of Life 2013-07-04 00:18:14 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 6 johan A. van Zanten 2013-07-28 12:25:35 UTC

We are experiencing the same error reports on several machines with the same model of Supermicro motherboard.  A few data points here, in case it helps narrow things down for someone:

- same motherboard: 
        Manufacturer: Supermicro
        Product Name: X9DRi-LN4+/X9DR3-LN4+

- two processors

- memory slots fully populated (24 x 16 GB Samsung M393B2G70BH0-CK0)

- OS for our machines is Debian.

- stock Debian 6 kernel (2.6.32) crashed (initial experiences with these motherboards in Nov/Dec 2012).

- with kernel version 3.2.41 and 3.9.8, the MCE errors are logged roughly once per hour of uptime.

- memtester (multiple instances running concurrently testing 95% of RAM) does not seem to trigger the errors

- machines which trigger the errors are all heavily use Postgresql servers, with three FusionIO cards (each).

- This big report was the only google hit for "EDAC MC1: 1 CE memory read error on". Perhaps not a coincidence that the reporter and i have the same motherboard?

- BIOS at latest revision (2.0 Release Date: 01/11/2013).  Errors experienced on previous version (1.0c from 05/01/2012) as well.

Comment 7 Fedora End Of Life 2013-08-01 03:26:28 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 8 johan A. van Zanten 2013-08-20 13:17:03 UTC

After a lot of diagnostics and working with vendor support, it appears this is almost certainly a hardware problem with some versions of X9DR3-LN4+ motherboards.

The problem boards report "REV:1.10" as their Version in 'dmidecode -t baseboard'.

At our site, older boards with a Version of "0123456789" have not produced the errors, and we are replacing the faulty boards with newer boards of the same model, Version "REV:1.20A".

 On the faulty motherboards, the errors seems to manifest mostly with the higher speed 2.90 GHz E5-2690 processors and full (24 RDIMMM) RAM configs, but we have been able to reproduce it with fewer RDIMMs.

FWIW, memtester did not generate the errors; the method i hit upon was just to exercise the buffer cache.  So on a system with 384 GB of RAM, i'd put about 400 GB of data in a local file system mounted at /scratch, and do:

while true ; tar cf - /scratch | cat - >/dev/null ; done

(In my experiments, writing to /dev/null from tar would not work... the "cat - >/dev/null" was required.)

While this is running, you can check the error counts with this:

cat /sys/devices/system/edac/mc/mc?/ce*count

The observed Error rate was usually at least one MCE error per hour.

Comment 9 Micheal 2016-12-12 08:50:25 UTC

(In reply to johan A. van Zanten from comment #8)
> After a lot of diagnostics and working with vendor support, it appears this
> is almost certainly a hardware problem with some versions of X9DR3-LN4+
> motherboards.
> 
> The problem boards report "REV:1.10" as their Version in 'dmidecode -t
> baseboard'.
> 
> At our site, older boards with a Version of "0123456789" have not produced
> the errors, and we are replacing the faulty boards with newer boards of the
> same model, Version "REV:1.20A".
> 
>  On the faulty motherboards, the errors seems to manifest mostly with the
> higher speed 2.90 GHz E5-2690 processors and full (24 RDIMMM) RAM configs,
> but we have been able to reproduce it with fewer RDIMMs.
> 
> FWIW, memtester did not generate the errors; the method i hit upon was just
> to exercise the buffer cache.  So on a system with 384 GB of RAM, i'd put
> about 400 GB of data in a local file system mounted at /scratch, and do:
> 
> while true ; tar cf - /scratch | cat - >/dev/null ; done
> 
> (In my experiments, writing to /dev/null from tar would not work... the "cat
> - >/dev/null" was required.)
> 
> While this is running, you can check the error counts with this:
> 
> cat /sys/devices/system/edac/mc/mc?/ce*count
> 
> The observed Error rate was usually at least one MCE error per hour.

have you solved this error？
the same problem occured on my server，the same motherboard as yours，memory slots fully populated（Kingston，384G totally）

Comment 10 valentin.kuepper 2017-04-26 18:29:46 UTC

We have the same issue with 4x units of X9DRi-LN4+ REV:1.10 with different ram modules and linux kernels. We're using DUAL Xeon E5-2670.

Comment 11 Tom H 2018-05-03 01:31:52 UTC

Seeing this intermittently on a X9DAi, version is reported as "0123456789" with dual E5-2620
The board recently had more memory added to it, which might have triggered an existing problem

Note You need to log in before you can comment on or make changes to this bug.