Bug 390601 - [RHEL5] EDAC k8 MC0: extended error code: GART error
[RHEL5] EDAC k8 MC0: extended error code: GART error
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Aristeu Rozanski
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-19 11:30 EST by Aristeu Rozanski
Modified: 2009-01-20 15:07 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 15:07:43 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test patch (2.06 KB, patch)
2008-01-31 14:25 EST, Aristeu Rozanski
no flags Details | Diff

  None (edit)
Description Aristeu Rozanski 2007-11-19 11:30:50 EST
+++ This bug was initially created as a clone of Bug #232488 +++

Description of problem:
 While running standard test on host wildhorse the system get errors.

Version-Release number of selected component (if applicable):
 RHEL4-U5-re20070301.1
 2.6.9-50.EL

How reproducible:
 Always

Steps to Reproduce:
1. Run test /kernel/stress/racer or /kernel/standards/byte
2. look at dmesg

Actual results:
Log has the following errors
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error

-- Additional comment from konradr@redhat.com on 2007-05-29 16:16 EST --
Chris,

Does your wildhourse do the same thing? This is what dmidecode says about this
box (the Rev3 sounds like a pre-GA hardware?)

Manufacturer: IBM
Product Name: -[621733Z]-
Version: REV3
Serial Number: 23A0008

BIOS:
Version: PXE136A
Release Date: 09/26/2006


-- Additional comment from lcm@us.ibm.com on 2007-06-06 15:19 EST --
James will try to reproduce.

-- Additional comment from jmtt@us.ibm.com on 2007-07-03 12:28 EST --
(In reply to comment #2)
> James will try to reproduce.

FYI, I am waiting for Red Hat to supply info on how to get copies of the
/kernel/stress/racer or /kernel/standards/byte tests to attempt this repro.

-- Additional comment from arozansk@redhat.com on 2007-07-03 13:41 EST --
Created an attachment (id=158461)
byte


-- Additional comment from arozansk@redhat.com on 2007-07-03 13:41 EST --
and the racer testcase:
ftp://ftp.lustre.org/pub/benchmarks/racer-lustre.tar.gz


-- Additional comment from jmtt@us.ibm.com on 2007-07-03 18:56 EST --
FYI, I get *an* EDAC error, but not an exact match to the error message in this
bug report:

EDAC k8 MC0: general bus error: participating processor(local node response),
time-out(no timeout) memory transaction type(generic read), mem or i/o(mem
access), cache level(generic)
MC0: INTERNAL ERROR: channel out of range (1 >= 1)
MC0: CE - no information available: INTERNAL ERROR
EDAC k8 MC0: extended error code: ECC error
EDAC k8 MC0: general bus error: participating processor(local node response),
time-out(no timeout) memory transaction type(generic read), mem or i/o(mem
access), cache level(generic)
MC0: INTERNAL ERROR: channel out of range (1 >= 1)
MC0: CE - no information available: INTERNAL ERROR
EDAC k8 MC0: extended error code: ECC error
EDAC k8 MC0: general bus error: participating processor(local node response),
time-out(no timeout) memory transaction type(generic read), mem or i/o(mem
access), cache level(generic)
MC0: INTERNAL ERROR: channel out of range (1 >= 1)
MC0: CE - no information available: INTERNAL ERROR
EDAC k8 MC0: extended error code: ECC error

I got a few of these while running the 'bm/Run' test; and a lot more while
running 'racer.sh'.

We don't have an image of the 20070301 snapshot anymore, so I ran the RHEL4.5 GA
bits (I think it was dated 20070421?): Linux elm3a72 2.6.9-55.ELsmp #1 SMP Fri
Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

I am running the same BIOS level as the submitter:
        BIOS Information
                Vendor: IBM
                Version: PXE136A
                Release Date: 09/26/2006

One other thing... I'm pretty sure I've seen this same message with RHEL5 also.
 At the time, I was wondering if my hardware was flaky.  If it would be of help,
I can try the same test with some non-RedHat distro to see if the problem goes away.


-- Additional comment from alan@redhat.com on 2007-07-04 08:57 EST --
ECC error does sound like this box has possible RAM problems.


-- Additional comment from jburke@redhat.com on 2007-11-13 14:24 EST --
This issue is stile there :/

EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error

So far nothing like this has been seen while running the same test on RHEL5.

http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1166465

Thanks,
Jeff 

-- Additional comment from pm-rhel@redhat.com on 2007-11-13 14:25 EST --
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

-- Additional comment from arozansk@redhat.com on 2007-11-15 15:42 EST --
Couldn't reproduce the problem using iommu=noaperture argument



-- Additional comment from arozansk@redhat.com on 2007-11-15 17:19 EST --
Just reproduced the problem using racer stress test with kernel 2.6.18-53.el5
Comment 1 Aristeu Rozanski 2007-11-26 14:23:08 EST
After more investigation, I found that:
   - the errors in this case aren't result of GART misconfiguration
   - it's possible to trigger the GART table walk errors incorrectly (probably
     it's the case here)
   - the GART table walk error reporting is intended for graphics drivers
     developers and AMD recommends that it must be off by default. MCE reporting
     of those errors are disabled in RHEL4 and RHEL5 (mce_cpu_quirks())
   - the errors are harmless according with AMD documentation
So I've submitted upstream a patch to only enable these messages upon a module
option.

Comment 2 Aristeu Rozanski 2008-01-31 14:25:47 EST
Created attachment 293630 [details]
test patch
Comment 3 Aristeu Rozanski 2008-02-04 10:22:39 EST
patch tested with success on wildhorse.
Comment 5 RHEL Product and Program Management 2008-06-06 11:44:52 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 6 Don Zickus 2008-07-18 16:06:39 EDT
in kernel-2.6.18-98.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 11 errata-xmlrpc 2009-01-20 15:07:43 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.