Bug 875194
Summary: | sbridge: HANDLING MCE MEMORY ERROR | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | joshua |
Component: | kernel | Assignee: | Linda Wang <lwang> |
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 17 | CC: | 729190509, gansalmon, itamar, johanvz653, jonathan, kernel-maint, madhu.chinakonda, tom, valentin.kuepper |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-08-01 03:26:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
joshua
2012-11-09 18:55:21 UTC
running the latest BIOS ? I think so, but I'll check. Does BIOS matter at this point for memory access? Mauro, any thoughts here? This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. We are experiencing the same error reports on several machines with the same model of Supermicro motherboard. A few data points here, in case it helps narrow things down for someone: - same motherboard: Manufacturer: Supermicro Product Name: X9DRi-LN4+/X9DR3-LN4+ - two processors - memory slots fully populated (24 x 16 GB Samsung M393B2G70BH0-CK0) - OS for our machines is Debian. - stock Debian 6 kernel (2.6.32) crashed (initial experiences with these motherboards in Nov/Dec 2012). - with kernel version 3.2.41 and 3.9.8, the MCE errors are logged roughly once per hour of uptime. - memtester (multiple instances running concurrently testing 95% of RAM) does not seem to trigger the errors - machines which trigger the errors are all heavily use Postgresql servers, with three FusionIO cards (each). - This big report was the only google hit for "EDAC MC1: 1 CE memory read error on". Perhaps not a coincidence that the reporter and i have the same motherboard? - BIOS at latest revision (2.0 Release Date: 01/11/2013). Errors experienced on previous version (1.0c from 05/01/2012) as well. Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. After a lot of diagnostics and working with vendor support, it appears this is almost certainly a hardware problem with some versions of X9DR3-LN4+ motherboards. The problem boards report "REV:1.10" as their Version in 'dmidecode -t baseboard'. At our site, older boards with a Version of "0123456789" have not produced the errors, and we are replacing the faulty boards with newer boards of the same model, Version "REV:1.20A". On the faulty motherboards, the errors seems to manifest mostly with the higher speed 2.90 GHz E5-2690 processors and full (24 RDIMMM) RAM configs, but we have been able to reproduce it with fewer RDIMMs. FWIW, memtester did not generate the errors; the method i hit upon was just to exercise the buffer cache. So on a system with 384 GB of RAM, i'd put about 400 GB of data in a local file system mounted at /scratch, and do: while true ; tar cf - /scratch | cat - >/dev/null ; done (In my experiments, writing to /dev/null from tar would not work... the "cat - >/dev/null" was required.) While this is running, you can check the error counts with this: cat /sys/devices/system/edac/mc/mc?/ce*count The observed Error rate was usually at least one MCE error per hour. (In reply to johan A. van Zanten from comment #8) > After a lot of diagnostics and working with vendor support, it appears this > is almost certainly a hardware problem with some versions of X9DR3-LN4+ > motherboards. > > The problem boards report "REV:1.10" as their Version in 'dmidecode -t > baseboard'. > > At our site, older boards with a Version of "0123456789" have not produced > the errors, and we are replacing the faulty boards with newer boards of the > same model, Version "REV:1.20A". > > On the faulty motherboards, the errors seems to manifest mostly with the > higher speed 2.90 GHz E5-2690 processors and full (24 RDIMMM) RAM configs, > but we have been able to reproduce it with fewer RDIMMs. > > FWIW, memtester did not generate the errors; the method i hit upon was just > to exercise the buffer cache. So on a system with 384 GB of RAM, i'd put > about 400 GB of data in a local file system mounted at /scratch, and do: > > while true ; tar cf - /scratch | cat - >/dev/null ; done > > (In my experiments, writing to /dev/null from tar would not work... the "cat > - >/dev/null" was required.) > > While this is running, you can check the error counts with this: > > cat /sys/devices/system/edac/mc/mc?/ce*count > > The observed Error rate was usually at least one MCE error per hour. have you solved this error? the same problem occured on my server,the same motherboard as yours,memory slots fully populated(Kingston,384G totally) We have the same issue with 4x units of X9DRi-LN4+ REV:1.10 with different ram modules and linux kernels. We're using DUAL Xeon E5-2670. Seeing this intermittently on a X9DAi, version is reported as "0123456789" with dual E5-2620 The board recently had more memory added to it, which might have triggered an existing problem |