Description of problem: AMD64 based system (8131 chipset) quad AMD Opteron(tm) Processor 850 will produce the gart errors below, typically when the system is under IO load. This problem occurs with RedHat Enterprise Kernel, but never with mainline 2.4.27 kernel. Problem also goes away when Enterprise kernel is recompiled without AGP support. Aug 24 04:24:34 StarDragon kernel: CPU 0: Silent Northbridge MCE Aug 24 04:24:34 StarDragon kernel: Northbridge status a60000010005001b Aug 24 04:24:34 StarDragon kernel: GART TLB error generic level generic Aug 24 04:24:34 StarDragon kernel: extended error gart error Aug 24 04:24:34 StarDragon kernel: link number 0 Aug 24 04:24:34 StarDragon kernel: err cpu1 Aug 24 04:24:34 StarDragon kernel: processor context corrupt Aug 24 04:24:34 StarDragon kernel: error address valid Aug 24 04:24:34 StarDragon kernel: error uncorrected Aug 24 04:24:34 StarDragon kernel: previous error lost Aug 24 04:24:34 StarDragon kernel: error address 0000000037f28a28 Aug 24 04:27:05 StarDragon kernel: CPU 2: Silent Northbridge MCE Aug 24 04:27:05 StarDragon kernel: Northbridge status a60000010005001b Aug 24 04:27:05 StarDragon kernel: GART TLB error generic level generic Aug 24 04:27:05 StarDragon kernel: extended error gart error Aug 24 04:27:05 StarDragon kernel: link number 0 Aug 24 04:27:05 StarDragon kernel: err cpu1 Aug 24 04:27:05 StarDragon kernel: processor context corrupt Aug 24 04:27:05 StarDragon kernel: error address valid Aug 24 04:27:05 StarDragon kernel: error uncorrected Aug 24 04:27:05 StarDragon kernel: previous error lost Aug 24 04:27:05 StarDragon kernel: error address 0000000037f28a20 Aug 24 04:31:36 StarDragon kernel: CPU 0: Silent Northbridge MCE Aug 24 04:31:36 StarDragon kernel: CPU 3: Silent Northbridge MCE Aug 24 04:31:36 StarDragon kernel: Northbridge status a60000010005001b Aug 24 04:31:36 StarDragon kernel: CPU 1: Silent Northbridge MCE Aug 24 04:31:36 StarDragon kernel: Northbridge status a60000010005001b Aug 24 04:31:36 StarDragon kernel: GART TLB error generic level generic Aug 24 04:31:36 StarDragon kernel: extended error gart error Aug 24 04:31:36 StarDragon kernel: link number 0 Aug 24 04:31:36 StarDragon kernel: err cpu1 Aug 24 04:31:36 StarDragon kernel: processor context corrupt Aug 24 04:31:36 StarDragon kernel: error address valid Aug 24 04:31:36 StarDragon kernel: error uncorrected Aug 24 04:31:36 StarDragon kernel: previous error lost Aug 24 04:31:46 StarDragon kernel: error address 0000000037f29b30 Aug 24 04:31:47 StarDragon kernel: GART TLB error generic level generic Aug 24 04:31:47 StarDragon kernel: extended error gart error Aug 24 04:31:48 StarDragon kernel: link number 0 Aug 24 04:31:49 StarDragon kernel: err cpu1 Aug 24 04:31:50 StarDragon kernel: processor context corrupt Aug 24 04:31:53 StarDragon kernel: error address valid Aug 24 04:31:53 StarDragon kernel: error uncorrected Aug 24 04:31:53 StarDragon kernel: previous error lost Aug 24 04:31:54 StarDragon kernel: error address 0000000037f29b30 Aug 24 04:31:54 StarDragon kernel: Northbridge status a60000010005001b Aug 24 04:31:57 StarDragon kernel: GART TLB error generic level generic Aug 24 04:31:57 StarDragon kernel: extended error gart error Aug 24 04:31:58 StarDragon kernel: link number 0 Aug 24 04:31:58 StarDragon kernel: err cpu1 Aug 24 04:31:58 StarDragon kernel: processor context corrupt Aug 24 04:32:00 StarDragon kernel: error address valid Aug 24 04:32:01 StarDragon kernel: error uncorrected Aug 24 04:32:02 StarDragon kernel: previous error lost Aug 24 04:32:02 StarDragon kernel: error address 0000000037f27598 Version-Release number of selected component (if applicable): Linux AGP driver 0.99 How reproducible: Every time Steps to Reproduce: 1. boot machine 2. run an excerciser that performs heavy CPU and IO operations, in some cases you only need to let the machine sit idle for a while. 3. watch for gart errors. Actual results: GART errors will be produced at any given time, but will most likely occur under heavy IO load. Expected results: No gart errors should occur. Additional info:
This appears to be a known issue with IOMMU prefetch on this chipset. A fix for this issue has been made in 2.4.21-19.EL. Please obtain this kernel or later and try again.
This kernel, 2.4.21-19.EL, does not appear to be on rhn.redhat.com. Is it released yet? Thanks.
Hello, Arthur. The -19.EL and -20.EL kernels were respins of the U3 beta kernel, and were never put into the RHN beta channel. We anticipate releasing U3 sometime next week, at which point the -20.EL kernel will be officially available in the main RHEL3 RHN channel. In the meantime, I have put the UP and SMP kernel RPMs on my "people" page, which you can download through the Web using the following URLs: http://people.redhat.com/~petrides/.celestica/kernel-2.4.21-20.EL.x86_64.rpm http://people.redhat.com/~petrides/.celestica/kernel-smp-2.4.21-20.EL.x86_64.rpm Please let me know when you've retrieved whichever one you need to verify that Jim's fix addresses the problem you reported (so that I can free up the space on that site). In the meantime, I'm changing the state of this bugzilla to MODIFIED, since we believe the change that was committed in 2.4.21-19.EL fixes the problem.
Thanks Ernie, I have tested the smp release on the server and ran the test overnight. Unfortunately, I have to report the presence of a GART error. Again, this does not appear with mainstream 2.4.27 kernel, even with agp built in. There is also additional DMA debug info that is in the log: Aug 30 13:38:24 StarDragon kernel: Mem-info: Aug 30 13:38:24 StarDragon kernel: Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Aug 30 13:38:24 StarDragon kernel: Zone:Normal freepages: 1275 min: 1279 low: 17406 high: 25597 Aug 30 13:38:24 StarDragon kernel: Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Aug 30 13:38:24 StarDragon kernel: Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Aug 30 13:38:24 StarDragon kernel: Zone:Normal freepages: 1307 min: 1279 low: 17406 high: 25597 Aug 30 13:38:24 StarDragon kernel: Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Aug 30 13:38:24 StarDragon kernel: Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Aug 30 13:38:24 StarDragon kernel: Zone:Normal freepages: 1329 min: 1279 low: 17406 high: 25597 Aug 30 13:38:24 StarDragon kernel: Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Aug 30 13:38:24 StarDragon kernel: Zone:DMA freepages: 1055 min: 1056 low: 1088 high: 1120 Aug 30 13:38:24 StarDragon kernel: Zone:Normal freepages: 2036 min: 1279 low: 17342 high: 25501 Aug 30 13:38:29 StarDragon kernel: Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Aug 30 13:38:29 StarDragon kernel: Free pages: 7020 ( 0 HighMem) Aug 30 13:38:30 StarDragon kernel: ( Active: 3440126/224002, inactive_laundry: 33959, inactive_clean: 34011, free: 7035 ) Aug 30 13:38:30 StarDragon kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Aug 30 13:38:30 StarDragon kernel: aa:984576 ac:5282 id:0 il:0 ic:0 fr:1275 Aug 30 13:38:30 StarDragon kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Aug 30 13:38:31 StarDragon kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Aug 30 13:38:32 StarDragon kernel: aa:900313 ac:13334 id:72369 il:11219 ic:11161 fr:1340 Aug 30 13:38:32 StarDragon kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Aug 30 13:38:33 StarDragon kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Aug 30 13:38:34 StarDragon kernel: aa:970654 ac:9641 id:23912 il:3618 ic:3655 fr:1329 Aug 30 13:38:34 StarDragon kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Aug 30 13:38:34 StarDragon kernel: aa:1543 ac:0 id:0 il:0 ic:0 fr:1055 Aug 30 13:38:35 StarDragon kernel: aa:505483 ac:49300 id:127721 il:19122 ic:19195 fr:2036 Aug 30 13:38:35 StarDragon kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Aug 30 13:38:36 StarDragon kernel: 33*4kB 1*8kB 0*16kB 1*32kB 1*64kB 2*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 5100kB) Aug 30 13:38:36 StarDragon kernel: Swap cache: add 2499902, delete 1034741, find 2689/3495, race 0+0 Aug 30 13:38:37 StarDragon kernel: 28144 pages of slabcache Aug 30 13:38:37 StarDragon kernel: 214 pages of kernel stacks Aug 30 13:38:37 StarDragon kernel: 1261 lowmem pagetables, 6180 highmem pagetables Aug 30 13:38:38 StarDragon kernel: Free swap: 0kB Aug 30 13:38:39 StarDragon kernel: 4194300 pages of RAM Aug 30 13:38:39 StarDragon kernel: 14579 free pages Aug 30 13:38:40 StarDragon kernel: 410692 reserved pages Aug 30 13:38:41 StarDragon kernel: 134023 pages shared Aug 30 13:38:41 StarDragon kernel: 1465161 pages swap cached Aug 30 13:38:41 StarDragon kernel: Buffer memory: 1816kB Aug 30 13:38:42 StarDragon kernel: Cache memory: 6674564kB Aug 30 13:38:43 StarDragon kernel: CLEAN: 13869 buffers, 55476 kbyte, 41 used (last=273), 0 locked, 0 dirty 0 delay Aug 30 13:38:44 StarDragon kernel: LOCKED: 3 buffers, 12 kbyte, 3 used (last=3), 0 locked, 0 dirty 0 delay Aug 30 13:38:44 StarDragon kernel: DIRTY: 139 buffers, 556 kbyte, 139 used (last=139), 0 locked, 131 dirty 0 delay Aug 30 13:38:44 StarDragon kernel: Out of Memory: Killed process 4523 (memtest). Aug 30 14:39:16 StarDragon kernel: CPU 1: Silent Northbridge MCE Aug 30 14:39:16 StarDragon kernel: CPU 2: Silent Northbridge MCE Aug 30 14:39:16 StarDragon kernel: Northbridge status a60000010005001b Aug 30 14:39:16 StarDragon kernel: GART TLB error generic level generic Aug 30 14:39:16 StarDragon kernel: extended error gart error Aug 30 14:39:16 StarDragon kernel: link number 0 Aug 30 14:39:16 StarDragon kernel: err cpu1 Aug 30 14:39:16 StarDragon kernel: processor context corrupt Aug 30 14:39:16 StarDragon kernel: error address valid Aug 30 14:39:16 StarDragon kernel: error uncorrected Aug 30 14:39:16 StarDragon kernel: previous error lost Aug 30 14:39:16 StarDragon kernel: error address 0000000037ff0020 Aug 30 14:39:16 StarDragon kernel: Northbridge status a60000010005001b Aug 30 14:39:17 StarDragon kernel: GART TLB error generic level generic Aug 30 14:39:18 StarDragon kernel: extended error gart error Aug 30 14:39:19 StarDragon kernel: link number 0 Aug 30 14:39:19 StarDragon kernel: err cpu1 Aug 30 14:39:19 StarDragon kernel: processor context corrupt Aug 30 14:39:20 StarDragon kernel: error address valid Aug 30 14:39:20 StarDragon kernel: error uncorrected Aug 30 14:39:20 StarDragon kernel: previous error lost Aug 30 14:39:20 StarDragon kernel: error address 0000000037ff0008 I am aware of other issues that arose during test, such as running out of disk space and memory, however, I did not expect to still see the presence of a GART error. Thanks for your help, and hope my testing has been of assistance.
I was just wondering if I can get the necessary patches to bring -15 up to -20, or if I canget the -20 source, so I can help debug this problem. The GART error resolution is a todo action item on my list, so I would like to assist however I can. Thanks
Arthur, I have just removed the 2 RPMs that I had placed under my "people page" for you, and have added the kernel-source RPM there. Please use the following URL to download it: http://people.redhat.com/~petrides/.celestica/kernel-source-2.4.21-20.EL.x86_64.rpm Thanks. -ernie
While you're looking at the source, can you be more precise about your test so we can try to reproduce the problem here? Does it only happen on one model of system?
Have the source, thanks. It was reported to me that we are seeing this problem on both a 4 way Opteron model, as well as our 2 way model. They both use the 8111 and 8131 chipsets. Neither system model has an AGP bus. They do have different BIOS vendors, but neither is immune to this problem. It also seems to become more profound, depending on the TYPE of processor used. A 2.4GHz CG processor is likely to get GART errors more frequently than a 1.8GHZ C0. Neither will get them when using the mainstream 2.4.27 kernel. Also, I have discovered that the latest SuSE kernel has fixed this problem as well. They used to have the same problem, but the latest release of their Enterprise kernel does not have this problem anymore. The test that I have been using is an in-house test, called CSU-ST. I am probably not able to release it yet, so I unfortuantely am unable to send a copy until I get permission. However, I believe all that is done to produce these errors are CPU floating point tests, your standard memory excerciser tests, and a lot of heavy IO by reading/writing to both the onboard SCSI disk(s) as well as the CDROM simultaneously. It is a pretty heavy test, as even a 4 way system will come to a crawl when this is running. Let me know if you need more information. Thanks!
I just got info that you can use the "Cerberus" test found on Sourceforge. This should do pretty much the same thing that our test is doing now, and should help produce GART errors.
Arthur, this bug number was listed in the U3 kernel erratum (RHBA-2004:433), which was frozen on Monday before it was realized that the problem has not been resolved. Since U3 is already in the process of being released (pushed live on RHN), I am not able to correct the bug list. Thus, this bug is going to be closed automatically by our Errata System in the next 24 hours. I will reopen the bug after this occurs. -ernie
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-433.html
Hi Ernie, I think it's possible that the test I was performing overnight may have been bunked by a passed parameter to kernel commandline that should not have been there. I am going to re-run the overnight test and let you know how it goes. I have tested the new -20 kernel for the past few hours now and it looks good so far. You may have the correct fix. I'll let you know if goes successfully here by tomorrow.
Ok Arthur, thanks for the update. I'll leave the bug closed for now. If you find that the -20.EL kernel still exhibits the original problem, please reopen the bug. Thanks. -ernie
I have run the overnight tests and conclude that the fixes to IOMMU prefetch (and all other related cleanup) that you guys have done related to GART has fixed this problem. (-20 kernel does have fix) There have been 0 occurences of GART errors after an overnight run. Good work! Best regards and thanks, Art Perry
That's a relief! Thanks, Arthur. -ernie
Jim, Please review this Bugzilla and the associated Issue Tracker and determine if we need to leave this report as reopened, or open a new Bugzilla.
FYI, We saw the bug at our site using kernel 2.4.21-27.ELsmp. We toggled the TLB reload in the BIOS and the problem seems to have been solved. --Court Cannick--
courtc, That's an interesting tip; I'll check it out. What model of system is this? Also, how easily can you reproduce?
We're running on a dual Opteron Sun v40z server. ====================================================================== This is a known bug described in the v40z release notes. Translation Look-Aside Buffer (TLB) Reload Causes Errors With Certain Linux Software In the BIOS Advanced menu, there is an option named "No Spec. TLB Reload." By default, this setting is disabled, which allows TLB reload. With this default setting, errors similar to the following have been observed on systems running any 64-bit version of Red Hat Linux and also SUSE Linux with Service Pack 1. Northbridge status a60000010005001b GART error 11 Lost an northbridge error NB status: unrecoverable NB error address 0000000037ff07f8 Error uncorrected To avoid these errors, you must disallow TLB reloading. To do this: 1. Reboot the server and press F2 to enter BIOS setup. 2. Navigate to the Advanced > Chipset Configuration BIOS menu. 3. Use the arrow keys to scroll down to the option "No Spec. TLB reload" and change its setting from Disabled to Enabled. This will disallow TLB reloading and avoid the error message. ====================================================================== --Court Cannick--
Hi, I'm seeing this error on my Quad processor Sun V40z as well running RHEL 3AS. I set No Spec TLB Enabled some months ago. I've noticed this error occouring more as the system loads and mainly noticed it after installing 2.4.21-27.0.4. It seemed to happen less on 2.4.21-27.0.2 though it may simply because we're using the machine more. This system maintains a load average between 1 and 2.5 around the clock. CPU 0: Silent Northbridge MCE Northbridge status a60000010005001b GART TLB error generic level generic extended error gart error link number 0 err cpu1 processor context corrupt error address valid error uncorrected previous error lost error address 000000007ffe0000
Adding a "me too" to this problem. IBM 326 dual CPU AMD 64 (Memory: 9981156k/10485760k) running 2.4.21-27.0.4.ELsmp under loads from iozone (during SAN testing). CPU 1: Silent Northbridge MCE Northbridge status a60000010005001b GART TLB error generic level generic extended error gart error link number 0 err cpu1 processor context corrupt error address valid error uncorrected previous error lost error address 000000013ffe0000 I'm calling IBM to see what they say (there are no references to TLB in the BIOS)
It turns out that this is a misreported error. arch/x86_64/bluesmoke.c was decoding the machine check status wrong. I'll attach a patch that brings RHEL3 in line with upstream.
Created attachment 115183 [details] Patch to arch/x86_64/kernel/bluesmoke.c
*** Bug 138192 has been marked as a duplicate of this bug. ***
patch posted for review 6/9/2005
devel ACK for U6
Created attachment 115407 [details] patch posted to rhkernel (ACKed)
Created attachment 115408 [details] RHEL3 patch posted to rhkernel this patch was acked and can be used for testing
Patch on schedule for delivery in RHEL3 U6.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.10.EL).
So far we have not been able to reproduce the GART error in house and its unclear which patch the vendor actually tested. I have been using "iozone" to reproduce this issue but so far have had no luck (using a recent RHEL3 kernel). This problem may be hardware specific and not a generic issue that occurs on all AMD64 SMP systems.
disable all AGP support in kernel will avoid this bug.
>Problem also goes away when Enterprise kernel is recompiled without AGP support. Oh. That's true. I got that error after a long time with ctcs.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html