Bug 390601

Summary: [RHEL5] EDAC k8 MC0: extended error code: GART error
Product: Red Hat Enterprise Linux 5 Reporter: Aristeu Rozanski <arozansk>
Component: kernelAssignee: Aristeu Rozanski <arozansk>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.1CC: alan, duck, dzickus, jmtt, konradr, lcm, rlerch
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:07:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test patch none

Description Aristeu Rozanski 2007-11-19 16:30:50 UTC
+++ This bug was initially created as a clone of Bug #232488 +++

Description of problem:
 While running standard test on host wildhorse the system get errors.

Version-Release number of selected component (if applicable):
 RHEL4-U5-re20070301.1
 2.6.9-50.EL

How reproducible:
 Always

Steps to Reproduce:
1. Run test /kernel/stress/racer or /kernel/standards/byte
2. look at dmesg

Actual results:
Log has the following errors
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error

-- Additional comment from konradr on 2007-05-29 16:16 EST --
Chris,

Does your wildhourse do the same thing? This is what dmidecode says about this
box (the Rev3 sounds like a pre-GA hardware?)

Manufacturer: IBM
Product Name: -[621733Z]-
Version: REV3
Serial Number: 23A0008

BIOS:
Version: PXE136A
Release Date: 09/26/2006


-- Additional comment from lcm.com on 2007-06-06 15:19 EST --
James will try to reproduce.

-- Additional comment from jmtt.com on 2007-07-03 12:28 EST --
(In reply to comment #2)
> James will try to reproduce.

FYI, I am waiting for Red Hat to supply info on how to get copies of the
/kernel/stress/racer or /kernel/standards/byte tests to attempt this repro.

-- Additional comment from arozansk on 2007-07-03 13:41 EST --
Created an attachment (id=158461)
byte


-- Additional comment from arozansk on 2007-07-03 13:41 EST --
and the racer testcase:
ftp://ftp.lustre.org/pub/benchmarks/racer-lustre.tar.gz


-- Additional comment from jmtt.com on 2007-07-03 18:56 EST --
FYI, I get *an* EDAC error, but not an exact match to the error message in this
bug report:

EDAC k8 MC0: general bus error: participating processor(local node response),
time-out(no timeout) memory transaction type(generic read), mem or i/o(mem
access), cache level(generic)
MC0: INTERNAL ERROR: channel out of range (1 >= 1)
MC0: CE - no information available: INTERNAL ERROR
EDAC k8 MC0: extended error code: ECC error
EDAC k8 MC0: general bus error: participating processor(local node response),
time-out(no timeout) memory transaction type(generic read), mem or i/o(mem
access), cache level(generic)
MC0: INTERNAL ERROR: channel out of range (1 >= 1)
MC0: CE - no information available: INTERNAL ERROR
EDAC k8 MC0: extended error code: ECC error
EDAC k8 MC0: general bus error: participating processor(local node response),
time-out(no timeout) memory transaction type(generic read), mem or i/o(mem
access), cache level(generic)
MC0: INTERNAL ERROR: channel out of range (1 >= 1)
MC0: CE - no information available: INTERNAL ERROR
EDAC k8 MC0: extended error code: ECC error

I got a few of these while running the 'bm/Run' test; and a lot more while
running 'racer.sh'.

We don't have an image of the 20070301 snapshot anymore, so I ran the RHEL4.5 GA
bits (I think it was dated 20070421?): Linux elm3a72 2.6.9-55.ELsmp #1 SMP Fri
Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

I am running the same BIOS level as the submitter:
        BIOS Information
                Vendor: IBM
                Version: PXE136A
                Release Date: 09/26/2006

One other thing... I'm pretty sure I've seen this same message with RHEL5 also.
 At the time, I was wondering if my hardware was flaky.  If it would be of help,
I can try the same test with some non-RedHat distro to see if the problem goes away.


-- Additional comment from alan on 2007-07-04 08:57 EST --
ECC error does sound like this box has possible RAM problems.


-- Additional comment from jburke on 2007-11-13 14:24 EST --
This issue is stile there :/

EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error

So far nothing like this has been seen while running the same test on RHEL5.

http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1166465

Thanks,
Jeff 

-- Additional comment from pm-rhel on 2007-11-13 14:25 EST --
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

-- Additional comment from arozansk on 2007-11-15 15:42 EST --
Couldn't reproduce the problem using iommu=noaperture argument



-- Additional comment from arozansk on 2007-11-15 17:19 EST --
Just reproduced the problem using racer stress test with kernel 2.6.18-53.el5

Comment 1 Aristeu Rozanski 2007-11-26 19:23:08 UTC
After more investigation, I found that:
   - the errors in this case aren't result of GART misconfiguration
   - it's possible to trigger the GART table walk errors incorrectly (probably
     it's the case here)
   - the GART table walk error reporting is intended for graphics drivers
     developers and AMD recommends that it must be off by default. MCE reporting
     of those errors are disabled in RHEL4 and RHEL5 (mce_cpu_quirks())
   - the errors are harmless according with AMD documentation
So I've submitted upstream a patch to only enable these messages upon a module
option.



Comment 2 Aristeu Rozanski 2008-01-31 19:25:47 UTC
Created attachment 293630 [details]
test patch

Comment 3 Aristeu Rozanski 2008-02-04 15:22:39 UTC
patch tested with success on wildhorse.


Comment 5 RHEL Program Management 2008-06-06 15:44:52 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Don Zickus 2008-07-18 20:06:39 UTC
in kernel-2.6.18-98.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 errata-xmlrpc 2009-01-20 20:07:43 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html