Description of problem: After booting with kernel-2.6.18-186.el5. I get an error message about a uncorrectable memory problem, 2 per second. Memtest+-4.00 shows no error after 8 hours. The problem did not occur with kernel-2.6.18-174.el5.gtest.79xen. Version-Release number of selected component (if applicable): RHEL5.5 Beta How reproducible: Everytime on my machine. MB: Asus P5BV-C/4L Bios 0311 (latest) Chipset: i3210 CPU: L3110 @ 3.00GHz RAM: 4*2GB Kingston with 64-bit ECC Steps to Reproduce: 1. boot Actual results: edac-util shows hundreds of bad memory. /var/log/messages: Feb 11 23:35:48 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE Feb 11 23:35:48 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 4, labels ":": i3200 UE Feb 11 23:35:49 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE Feb 11 23:35:49 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 4, labels ":": i3200 UE Feb 11 23:35:50 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE Feb 11 23:35:50 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE Feb 11 23:35:51 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 1, labels ":": i3200 UE Feb 11 23:35:51 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE Feb 11 23:35:52 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE Feb 11 23:35:52 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE Feb 11 23:35:53 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE Feb 11 23:35:53 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 4, labels ":": i3200 UE Feb 11 23:35:54 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE Feb 11 23:35:54 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE Feb 11 23:35:55 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE Feb 11 23:35:55 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 4, labels ":": i3200 UE Feb 11 23:35:56 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE Feb 11 23:35:56 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE Expected results: no error messages. Additional info: Maybe a reaction to BZ469976, but it's open for RHEL5.2, that's the reason for this new bug.
Created attachment 399247 [details] Check if MCH is enabled Support for i3200 EDAC were added after RHEL 5.2. That's why those error lines didn't appear with an older kernel. After reviewing the driver code, I suspect that maybe your computer BIOS didn't have properly enabled the MCHBAR registers that are required to get memory errors. Unfortunately, I couldn't reproduce your bug on the labs. As the driver doesn't explicitly check if MCHBAR is enabled, this may explain your bug. The enclosed patch adds an explicit check for it. Could you please try the enclosed patch? To make easier for tests, I've generated a test kernel (x86_64) at: http://people.redhat.com/mchehab/.bz564274/
I have the same errors with my gentoo system after upgrading from gentoo-sources 2.6.31-r10 to 2.6.32-r7. The newly added i3200 support in edac is not jet working properly on all hardware. I also have an Asus P5BV-C/4L board. I testet the patch but it did not change anything (i.e. the driver still loads not giving the mchbar error message) - so it seems MCHBAR is enabled.
I'm seeing the same error on Asus P5BV-C/4L, kernel-2.6.18-194.el5. The problem appeared after upgrading from 5.4 (kernel-2.6.18-164.15.1) to 5.5 (kernel-2.6.18-194.el5). Really very annoying to get these error sent to every console. Intel Xeon 3330, 4x 2GB 64bit ECC, Kingston 800Mhz [root@bigbang ~]# lsmod|grep edac i3200_edac 38865 0 edac_mc 60193 1 i3200_edac Testing kernel doesn't help my case either.
It would be interesting to study the differences between the systems where this fails, and the lab system that Mauro has where this works. Could people post dmidecode output from these machines?
Created attachment 411775 [details] dmidecode output of my Asus P5BV-C/4L MB On Tony Luck's request.
Created attachment 412270 [details] dmidecode output - UUID 'X-ed' out, Asus P5BV-C/4L If relevant, modules mentioned in comment #3 are currently blacklisted (not loaded).
I have the same problem on a Supermicro system. dmidecode output is attached
Created attachment 413276 [details] dmidecode output for Supermicro 5025B-T
Thanks for the dmidecode files. All three systems showing the problem have fully populated (4 x 2G) memory for the max that the systems support. Mauro: Can you track down your lab system on which this code worked to get the dmidecode information from it.
Created attachment 413512 [details] dmidecode output of gaia UUID 'X-ed' out, Asus P5BV-C/4L
My dmidecode output is attached. I had one other observation. I had no problems for around 2 month, after doing a bios reset. Yesterday I made a small change to the bios (no wait on bios error, like missing keyboard) and now the problem is back again. ;-(
Supermicro X7SBi+ with fully equiped RAM-slots (4 banks with 4x2GB) dmidecode-output is attached.
Created attachment 414407 [details] dmidecode from Supermicro X7SBi+ with 4x2GB RAM
I have 2 ASUS RS120-E5/PA4 (Board: P5BV-R). One with 2x2GB RAM and second with 4x2GB RAM. First server (2x2GB RAM) work with 2.6.18-194.el5 kernel without problem. Second server (4x2GB RAM) work with errors.
I'm just updated to 2.6.18-194.3.1.el5 kernel, but problem is still exist.
It seems to me that some BIOS'es are doing something wrong, causing those troubles. The temporary solution is to add i3200_edac to /etc/modprobe.d/blacklist: blacklist i3200_edac This will disable the load of the EDAC module, preventing the error to happen.
Mauro: I already did this, but I would like to detect memory-errors, before they ruin data. Is there a possibility to fix that without changes in the BIOS? Also you have to deactivate /etc/cron.daily/edac, otherwise you will get an error every day.
Mauro Carvalho Chehab: i have this problem on all my 17 ASUS RS120-E5/PA4 only where memory more than 4GB. Unlike i don't have no-ASUS with Chipset i3200 series board now and i cann't test on it.
It seems that several Supermicro-users spoke up in this thread and that it might have to do with some BIOS-specialities. Since we've had good experience with Supermicro tech-support, I've just tried contacting them to see if they maybe can confirm it might have to do with the BIOS.
Created attachment 415205 [details] dmidecode output from an IBM System x3250 M2 EDAC seems to be properly running on this machine, and it is not generating any errors. The machine has 2GB of RAM. $ dmesg|grep -i edac EDAC MC: Ver: 2.0.1 Mar 16 2010 EDAC MC0: Giving out device to i3200_edac i3200: DEV 0000:00:00.0 EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 6, labels ":": i3200 UE $ cd /sys/devices/system/edac/mc/mc0 $ cat csrow6/ce_count 0 $ cat csrow6/ch0_ce_count 0 $ cat csrow6/ch0_dimm_label $ cat csrow6/ch1_ce_count 0 $ cat csrow6/ch1_dimm_label $ cat csrow6/dev_type Unknown $ cat csrow6/edac_mode Unknown $ cat csrow6/mem_type DDR2 $ cat csrow6/size_mb 1024 $ cat csrow6/ue_count 1 $ cat csrow7/ce_count 0 $ cat csrow7/ch0_ce_count 0 $ cat csrow7/ch0_dimm_label $ cat csrow7/ch1_ce_count 0 $ cat csrow7/ch1_dimm_label $ cat csrow7/dev_type Unknown $ cat csrow7/edac_mode Unknown $ cat csrow7/mem_type DDR2 $ cat csrow7/size_mb 1024 $ cat csrow7/ue_count 0
(In reply to comment #19) > Mauro: I already did this, but I would like to detect memory-errors, before > they ruin data. Is there a possibility to fix that without changes in the BIOS? > > Also you have to deactivate /etc/cron.daily/edac, otherwise you will get an > error every day. If it is a BIOS problem, the solution would be to get a fixed BIOS. AFAIK, there's nothing that the driver can do to solve it. Does the EDAC driver detect the correct device info? You can check it by running something like: cd /sys/devices/system/edac/mc/mc0/; for i in csrow*/*; do echo "$ cat $i"; cat $i; done
An alternative to the "blame the BIOS" theory is the >4GB theory mentioned above in comment #21. Perhaps one of the users who has problems and 8G of memory can pull out a couple of DIMMs for a test to see if the errors are still reported. And/or someone with a working system can borrow enough 2G DIMMs to try their machine with 4x2GB to see if it still works.
(In reply to comment #25) > An alternative to the "blame the BIOS" theory is the >4GB theory mentioned > above in comment #21. Perhaps one of the users who has problems and 8G of > memory can pull out a couple of DIMMs for a test to see if the errors are still > reported. And/or someone with a working system can borrow enough 2G DIMMs to > try their machine with 4x2GB to see if it still works. I'm investigating this one. I've found a problem on the driver, if it have more than one i3200 chipset inside. Not sure if such architecture is possible, but I'm already working on a patch to remove the static vars from the driver.
Created attachment 415217 [details] Don't use static local vars at i3200 probe logic This patch is currently untested, but it may fix the bug, if it is related to having more than one memory controller at the machine. I'll be building a test kernel and run it on some machines to see if this patch won't break PCI probe.
The new test kernel with the two patches applied are available at: http://people.redhat.com/~mchehab/.bz564274/ I repeated the tests at the IBM x3250 M2 and it keeps properly working. Please test it.
Oh. I tested it and problem is still exist. But I have additional info now. I try remove 4Gb memory from 8GB server but get errors. So I try add 4GB memory to 4GB server and then remove it. Problem appeared. My BIOS has settings "Memory Remap Feature" (ENABLE: allow remapping of overlapped PCI memory above the total physical memory. DISABLE: Do not allow remapping of memory) and it's set as "Enable". When I set it to disable state, problem disappeared, but board see only 4GB memory :(
(In reply to comment #24) > > Does the EDAC driver detect the correct device info? You can check it by > running something like: > > cd /sys/devices/system/edac/mc/mc0/; for i in csrow*/*; do echo "$ cat $i"; cat > $i; done $ cat csrow0/ce_count What should i see? # cat mc_name i3200 # ls | grep csrow csrow0 csrow1 csrow4 csrow5 # cd /sys/devices/system/edac/mc/mc0/csrow0 # cat dev_type Unknown # cat mem_type DDR2 # cat edac_mode Unknown
Created attachment 415507 [details] dmidecode output from a HP DL320G5 with 8Gb The <4Gb thesis got to /dev/null. A test on a HP DL320G5 with 8Gb worked as expected. It properly detected the 8 1Gb DIMMS, and no errors are generated. So, the bug really seems how the BIOS is handling memory. I suspect that BIOS is, somehow, overlapping the PCI device mmapped RAM by something else, causing the trouble, when the remap mode is enabled. That's the direct EDAC readings, on the HP machine: # cd /sys/devices/system/edac/mc/mc0/; for i in csrow*/*; do echo "# cat $i"; cat $i; done # cat csrow0/ce_count 0 # cat csrow0/ch0_ce_count 0 # cat csrow0/ch0_dimm_label # cat csrow0/ch1_ce_count 0 # cat csrow0/ch1_dimm_label # cat csrow0/dev_type Unknown # cat csrow0/edac_mode Unknown # cat csrow0/mem_type DDR2 # cat csrow0/size_mb 1024 # cat csrow0/ue_count 0 # cat csrow1/ce_count 0 # cat csrow1/ch0_ce_count 0 # cat csrow1/ch0_dimm_label # cat csrow1/ch1_ce_count 0 # cat csrow1/ch1_dimm_label # cat csrow1/dev_type Unknown # cat csrow1/edac_mode Unknown # cat csrow1/mem_type DDR2 # cat csrow1/size_mb 1024 # cat csrow1/ue_count 0 # cat csrow2/ce_count 0 # cat csrow2/ch0_ce_count 0 # cat csrow2/ch0_dimm_label # cat csrow2/ch1_ce_count 0 # cat csrow2/ch1_dimm_label # cat csrow2/dev_type Unknown # cat csrow2/edac_mode Unknown # cat csrow2/mem_type DDR2 # cat csrow2/size_mb 1024 # cat csrow2/ue_count 0 # cat csrow3/ce_count 0 # cat csrow3/ch0_ce_count 0 # cat csrow3/ch0_dimm_label # cat csrow3/ch1_ce_count 0 # cat csrow3/ch1_dimm_label # cat csrow3/dev_type Unknown # cat csrow3/edac_mode Unknown # cat csrow3/mem_type DDR2 # cat csrow3/size_mb 1024 # cat csrow3/ue_count 0 # cat csrow4/ce_count 0 # cat csrow4/ch0_ce_count 0 # cat csrow4/ch0_dimm_label # cat csrow4/ch1_ce_count 0 # cat csrow4/ch1_dimm_label # cat csrow4/dev_type Unknown # cat csrow4/edac_mode Unknown # cat csrow4/mem_type DDR2 # cat csrow4/size_mb 1024 # cat csrow4/ue_count 0 # cat csrow5/ce_count 0 # cat csrow5/ch0_ce_count 0 # cat csrow5/ch0_dimm_label # cat csrow5/ch1_ce_count 0 # cat csrow5/ch1_dimm_label # cat csrow5/dev_type Unknown # cat csrow5/edac_mode Unknown # cat csrow5/mem_type DDR2 # cat csrow5/size_mb 1024 # cat csrow5/ue_count 0 # cat csrow6/ce_count 0 # cat csrow6/ch0_ce_count 0 # cat csrow6/ch0_dimm_label # cat csrow6/ch1_ce_count 0 # cat csrow6/ch1_dimm_label # cat csrow6/dev_type Unknown # cat csrow6/edac_mode Unknown # cat csrow6/mem_type DDR2 # cat csrow6/size_mb 1024 # cat csrow6/ue_count 0 # cat csrow7/ce_count 0 # cat csrow7/ch0_ce_count 0 # cat csrow7/ch0_dimm_label # cat csrow7/ch1_ce_count 0 # cat csrow7/ch1_dimm_label # cat csrow7/dev_type Unknown # cat csrow7/edac_mode Unknown # cat csrow7/mem_type DDR2 # cat csrow7/size_mb 1024 # cat csrow7/ue_count 0
Created attachment 416170 [details] HP DL320G5 BIOS configuration The HP Proliant BIOS doesn't have any "enable remap feature" configuration. I'm enclosing the current bios configuration as reference.
I had this problem on our ASUS RS120-E5/PA4 R. We were able to successfully work around it by adding "blacklist i3200_edac" to /etc/modprobe.d/blacklist followed by a reboot.
Base Board Information Manufacturer: ASUSTeK Computer INC. Product Name: P5BV-M Version: Rev 1.xxG output from command: cd /sys/devices/system/edac/mc/mc0/; for i in csrow*/*; do echo "# cat $i";cat $i; done # cat csrow0/ce_count 0 # cat csrow0/ch0_ce_count 0 # cat csrow0/ch0_dimm_label # cat csrow0/ch1_ce_count 0 # cat csrow0/ch1_dimm_label # cat csrow0/dev_type Unknown # cat csrow0/edac_mode Unknown # cat csrow0/mem_type DDR2 # cat csrow0/size_mb 1024 # cat csrow0/ue_count 35 # cat csrow1/ce_count 0 # cat csrow1/ch0_ce_count 0 # cat csrow1/ch0_dimm_label # cat csrow1/ch1_ce_count 0 # cat csrow1/ch1_dimm_label # cat csrow1/dev_type Unknown # cat csrow1/edac_mode Unknown # cat csrow1/mem_type DDR2 # cat csrow1/size_mb 1024 # cat csrow1/ue_count 18 # cat csrow2/ce_count 0 # cat csrow2/ch0_ce_count 0 # cat csrow2/ch0_dimm_label # cat csrow2/ch1_ce_count 0 # cat csrow2/ch1_dimm_label # cat csrow2/dev_type Unknown # cat csrow2/edac_mode Unknown # cat csrow2/mem_type DDR2 # cat csrow2/size_mb 1024 # cat csrow2/ue_count 54 # cat csrow3/ce_count 0 # cat csrow3/ch0_ce_count 0 # cat csrow3/ch0_dimm_label # cat csrow3/ch1_ce_count 0 # cat csrow3/ch1_dimm_label # cat csrow3/dev_type Unknown # cat csrow3/edac_mode Unknown # cat csrow3/mem_type DDR2 # cat csrow3/size_mb 1024 # cat csrow3/ue_count 181 # cat csrow4/ce_count 0 # cat csrow4/ch0_ce_count 0 # cat csrow4/ch0_dimm_label # cat csrow4/ch1_ce_count 0 # cat csrow4/ch1_dimm_label # cat csrow4/dev_type Unknown # cat csrow4/edac_mode Unknown # cat csrow4/mem_type DDR2 # cat csrow4/size_mb 1024 # cat csrow4/ue_count 50 # cat csrow5/ce_count 0 # cat csrow5/ch0_ce_count 0 # cat csrow5/ch0_dimm_label # cat csrow5/ch1_ce_count 0 # cat csrow5/ch1_dimm_label # cat csrow5/dev_type Unknown # cat csrow5/edac_mode Unknown # cat csrow5/mem_type DDR2 # cat csrow5/size_mb 1024 # cat csrow5/ue_count 47 # cat csrow6/ce_count 0 # cat csrow6/ch0_ce_count 0 # cat csrow6/ch0_dimm_label # cat csrow6/ch1_ce_count 0 # cat csrow6/ch1_dimm_label # cat csrow6/dev_type Unknown # cat csrow6/edac_mode Unknown # cat csrow6/mem_type DDR2 # cat csrow6/size_mb 1024 # cat csrow6/ue_count 79 # cat csrow7/ce_count 0 # cat csrow7/ch0_ce_count 0 # cat csrow7/ch0_dimm_label # cat csrow7/ch1_ce_count 0 # cat csrow7/ch1_dimm_label # cat csrow7/dev_type Unknown # cat csrow7/edac_mode Unknown # cat csrow7/mem_type DDR2 # cat csrow7/size_mb 1024 # cat csrow7/ue_count 113
The bug seems to happen only with the "Memory Remap Feature". Question: is the bug happening with x86_64 kernels, or only with i386 kernels with PAE enabled?
I'm use x86_64 kernel. And also i tested non-ASUS servers. The Bug isn't appear.
x86_64 kernel for my case too.
I'm experiencing this same bug on 2.6.18-194.8.1.el5 x86_64 install. The motherboard is an Asus RS100-E5/PI2 Barebones with P5BV-M/RS100-E5 motherboard with BIOS 0211 (latest). I have an E5300 Wolfdale CPU, and two 2GB 240-Pin DDR2 SDRAM ECC DDR2 800 (PC2 6400) Intel Certified Server Memory Model KVR800D2E5/2GI. I experience the problem with both dimms installed, and with either dimm installed in the lowest slot. # modprobe i3200_edac ; sleep 5; rmmod i3200_edac kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 2, labels ":": i3200 UE As the message repeats, the row will change randomly between 2, 3, 6, and 7. The grain always stays at exactly 2^30. I tried with the BIOS' "Memory Remap Feature" enabled and disabled. It had no effect. I will start the 32bit download... This system is not in production yet if someone wants me to try a patch.
x86_64 here as well
I am also having this problem on a server I'm renting. It was delivered with CentOS 5.4 and had no issues then after upgrading to CentOS 5.5 the error started to appear. I then stock installed CentOS using PXE + VNC boot and after accessing the new installation over SSH the issue still existed. The server has an Asus Rs100-e5 motherboard, Intel Core 2 Quad Q8300 CPU, 4GB of DDR2 Ram. Kernel is at 2.6.18-194.8.1.el5.
Just a heads up that should you wish to keep the edac module running, and logging these errors, but not destroy your console with loads of messages then you can use the following commands: echo "0" > /sys/devices/system/edac/mc/log_ue echo "0" > /sys/devices/system/edac/mc/log_ce To keep these after a restart then simply add these two lines to the bottom of /etc/rc.local and it works fine.
Did anyone test the new kernels ( 2.6.18-194.17.1 for instance) ? Was the bug fixed or not yet ?
(In reply to comment #42) > Did anyone test the new kernels ( 2.6.18-194.17.1 for instance) ? Was the bug > fixed or not yet ? I can confirm that the deluge of edacs that existed for me on 2.6.18-194.11.4 is still present in 194.17.1 [bcrook@hostname ~]$ dmesg | tail -n 4 ; uname -a EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 2, labels ":": i3200 UE EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 6, labels ":": i3200 UE EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 3, labels ":": i3200 UE EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 7, labels ":": i3200 UE Linux hostname.redacted.com 2.6.18-194.17.1.el5 #1 SMP Wed Sep 29 12:50:31 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
(In reply to comment #42) > Did anyone test the new kernels ( 2.6.18-194.17.1 for instance) ? Was the bug > fixed or not yet ? I couldn't reproduce this bug. It is probably due to some BIOS troubles, on some machines. Unfortunately, some BIOS do bad things with some memory controllers. The recommendation is to disable i3200_edac on those machines.
It seems that on some motherboards the memory-controller error handling is correctly initialized *only* if quick-boot mode is disabled. One of our boxes, previously affected by this problem, is now running with i3200_edac in polling mode and has not reported any UE/CE-events so far.
I just found further information in the edac-wiki: http://buttersideup.com/edacwiki/Uninitialized_ECC_bits The Asus P5BV is mentioned specifically to have this kind of problem. If we have confirmation that this solves the problem, I think we can close this bug.
Hi, I can confirm that this problem is gone, after changing the BIOS setting and doing a power-cycle/hw reset. (normal reboot does not help) I therefor think this Bug should be closed as NOTABUG. Very special Thanks to Hannes Sowa. CU Jens
I can also confirm, that the workaround solves the issue. I disabled quickboot in the BIOS and got no more false errors. MB: ASUS P5BV-C/4L gentoo with 2.6.36 kernel
So, based on all information we have, this is a BIOS bug. There's nothing we can do to solve it, except to document that, on some i3210 boards, BIOS don't properly enable the error correction checks, and that disabling quickboot may solve the issue. I'll close this bug with a Technical note.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Some i3210 BIOSes have problems enabling the hardware checks at the MCU. On those hardware, customers should try to disable Quickboot and/or"Memory Remap Feature" or to disable EDAC drivers. More details can be found on: https://bugzilla.redhat.com/show_bug.cgi?id=564274