+++ This bug was initially created as a clone of Bug #232488 +++ Description of problem: While running standard test on host wildhorse the system get errors. Version-Release number of selected component (if applicable): RHEL4-U5-re20070301.1 2.6.9-50.EL How reproducible: Always Steps to Reproduce: 1. Run test /kernel/stress/racer or /kernel/standards/byte 2. look at dmesg Actual results: Log has the following errors EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error -- Additional comment from konradr on 2007-05-29 16:16 EST -- Chris, Does your wildhourse do the same thing? This is what dmidecode says about this box (the Rev3 sounds like a pre-GA hardware?) Manufacturer: IBM Product Name: -[621733Z]- Version: REV3 Serial Number: 23A0008 BIOS: Version: PXE136A Release Date: 09/26/2006 -- Additional comment from lcm.com on 2007-06-06 15:19 EST -- James will try to reproduce. -- Additional comment from jmtt.com on 2007-07-03 12:28 EST -- (In reply to comment #2) > James will try to reproduce. FYI, I am waiting for Red Hat to supply info on how to get copies of the /kernel/stress/racer or /kernel/standards/byte tests to attempt this repro. -- Additional comment from arozansk on 2007-07-03 13:41 EST -- Created an attachment (id=158461) byte -- Additional comment from arozansk on 2007-07-03 13:41 EST -- and the racer testcase: ftp://ftp.lustre.org/pub/benchmarks/racer-lustre.tar.gz -- Additional comment from jmtt.com on 2007-07-03 18:56 EST -- FYI, I get *an* EDAC error, but not an exact match to the error message in this bug report: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) MC0: INTERNAL ERROR: channel out of range (1 >= 1) MC0: CE - no information available: INTERNAL ERROR EDAC k8 MC0: extended error code: ECC error EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) MC0: INTERNAL ERROR: channel out of range (1 >= 1) MC0: CE - no information available: INTERNAL ERROR EDAC k8 MC0: extended error code: ECC error EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) MC0: INTERNAL ERROR: channel out of range (1 >= 1) MC0: CE - no information available: INTERNAL ERROR EDAC k8 MC0: extended error code: ECC error I got a few of these while running the 'bm/Run' test; and a lot more while running 'racer.sh'. We don't have an image of the 20070301 snapshot anymore, so I ran the RHEL4.5 GA bits (I think it was dated 20070421?): Linux elm3a72 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux I am running the same BIOS level as the submitter: BIOS Information Vendor: IBM Version: PXE136A Release Date: 09/26/2006 One other thing... I'm pretty sure I've seen this same message with RHEL5 also. At the time, I was wondering if my hardware was flaky. If it would be of help, I can try the same test with some non-RedHat distro to see if the problem goes away. -- Additional comment from alan on 2007-07-04 08:57 EST -- ECC error does sound like this box has possible RAM problems. -- Additional comment from jburke on 2007-11-13 14:24 EST -- This issue is stile there :/ EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error So far nothing like this has been seen while running the same test on RHEL5. http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1166465 Thanks, Jeff -- Additional comment from pm-rhel on 2007-11-13 14:25 EST -- This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. -- Additional comment from arozansk on 2007-11-15 15:42 EST -- Couldn't reproduce the problem using iommu=noaperture argument -- Additional comment from arozansk on 2007-11-15 17:19 EST -- Just reproduced the problem using racer stress test with kernel 2.6.18-53.el5
After more investigation, I found that: - the errors in this case aren't result of GART misconfiguration - it's possible to trigger the GART table walk errors incorrectly (probably it's the case here) - the GART table walk error reporting is intended for graphics drivers developers and AMD recommends that it must be off by default. MCE reporting of those errors are disabled in RHEL4 and RHEL5 (mce_cpu_quirks()) - the errors are harmless according with AMD documentation So I've submitted upstream a patch to only enable these messages upon a module option.
Created attachment 293630 [details] test patch
patch tested with success on wildhorse.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-98.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html