Bug 564274 - fake EDAC errors on i3210
Summary: fake EDAC errors on i3210
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Mauro Carvalho Chehab
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-02-12 07:55 UTC by Jens Kuehnel
Modified: 2018-10-27 13:25 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Some i3210 BIOSes have problems enabling the hardware checks at the MCU. On those hardware, customers should try to disable Quickboot and/or"Memory Remap Feature" or to disable EDAC drivers. More details can be found on: https://bugzilla.redhat.com/show_bug.cgi?id=564274
Clone Of:
Environment:
Last Closed: 2011-01-28 20:38:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Check if MCH is enabled (1.09 KB, patch)
2010-03-11 04:46 UTC, Mauro Carvalho Chehab
no flags Details | Diff
dmidecode output of my Asus P5BV-C/4L MB (17.48 KB, text/plain)
2010-05-06 01:25 UTC, Matthias Prager
no flags Details
dmidecode output - UUID 'X-ed' out, Asus P5BV-C/4L (16.53 KB, text/plain)
2010-05-07 09:09 UTC, David Kovalsky
no flags Details
dmidecode output for Supermicro 5025B-T (11.69 KB, application/octet-stream)
2010-05-11 22:14 UTC, manuel wolfshant
no flags Details
dmidecode output of gaia UUID 'X-ed' out, Asus P5BV-C/4L (16.51 KB, application/octet-stream)
2010-05-12 18:15 UTC, Jens Kuehnel
no flags Details
dmidecode from Supermicro X7SBi+ with 4x2GB RAM (10.46 KB, text/plain)
2010-05-16 20:27 UTC, Stefan Neufeind
no flags Details
dmidecode output from an IBM System x3250 M2 (13.30 KB, application/octet-stream)
2010-05-19 17:20 UTC, Mauro Carvalho Chehab
no flags Details
Don't use static local vars at i3200 probe logic (4.33 KB, patch)
2010-05-19 18:03 UTC, Mauro Carvalho Chehab
no flags Details | Diff
dmidecode output from a HP DL320G5 with 8Gb (13.10 KB, application/octet-stream)
2010-05-20 20:45 UTC, Mauro Carvalho Chehab
no flags Details
HP DL320G5 BIOS configuration (2.32 KB, application/octet-stream)
2010-05-24 16:31 UTC, Mauro Carvalho Chehab
no flags Details

Description Jens Kuehnel 2010-02-12 07:55:17 UTC
Description of problem:
After booting with kernel-2.6.18-186.el5. I get an error message about a uncorrectable memory problem, 2 per second.

Memtest+-4.00 shows no error after 8 hours.

The problem did not occur with kernel-2.6.18-174.el5.gtest.79xen.


Version-Release number of selected component (if applicable):
RHEL5.5 Beta

How reproducible:
Everytime on my machine. 
MB: Asus P5BV-C/4L Bios 0311 (latest)
Chipset: i3210
CPU:  L3110 @ 3.00GHz
RAM: 4*2GB Kingston with 64-bit ECC

Steps to Reproduce:
1. boot
  
Actual results:
edac-util shows hundreds of bad memory.

/var/log/messages:
Feb 11 23:35:48 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE
Feb 11 23:35:48 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 4, labels ":": i3200 UE
Feb 11 23:35:49 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE
Feb 11 23:35:49 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 4, labels ":": i3200 UE
Feb 11 23:35:50 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE
Feb 11 23:35:50 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE
Feb 11 23:35:51 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 1, labels ":": i3200 UE
Feb 11 23:35:51 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE
Feb 11 23:35:52 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE
Feb 11 23:35:52 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE
Feb 11 23:35:53 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE
Feb 11 23:35:53 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 4, labels ":": i3200 UE
Feb 11 23:35:54 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE
Feb 11 23:35:54 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE
Feb 11 23:35:55 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE
Feb 11 23:35:55 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 4, labels ":": i3200 UE
Feb 11 23:35:56 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 0, labels ":": i3200 UE
Feb 11 23:35:56 gaia kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 5, labels ":": i3200 UE

Expected results:
no error messages.

Additional info:
Maybe a reaction to BZ469976, but it's open for RHEL5.2, that's the reason for this new bug.

Comment 1 Mauro Carvalho Chehab 2010-03-11 04:46:13 UTC
Created attachment 399247 [details]
Check if MCH is enabled

Support for i3200 EDAC were added after RHEL 5.2. That's why those error lines
didn't appear with an older kernel.

After reviewing the driver code, I suspect that maybe your computer BIOS didn't
have properly enabled the MCHBAR registers that are required to get memory
errors. Unfortunately, I couldn't reproduce your bug on the labs.

As the driver doesn't explicitly check if MCHBAR is enabled, this may explain your bug. The enclosed patch adds an explicit check for it.

Could you please try the enclosed patch? 

To make easier for tests, I've generated a test kernel (x86_64) at:
    http://people.redhat.com/mchehab/.bz564274/

Comment 2 Matthias Prager 2010-04-27 16:52:33 UTC
I have the same errors with my gentoo system after upgrading from gentoo-sources 2.6.31-r10 to 2.6.32-r7. The newly added i3200 support in edac is not jet working properly on all hardware. I also have an Asus P5BV-C/4L board.

I testet the patch but it did not change anything (i.e. the driver still loads not giving the mchbar error message) - so it seems MCHBAR is enabled.

Comment 3 David Kovalsky 2010-05-02 17:19:21 UTC
I'm seeing the same error on Asus P5BV-C/4L, kernel-2.6.18-194.el5.

The problem appeared after upgrading from 5.4 (kernel-2.6.18-164.15.1) to 5.5 (kernel-2.6.18-194.el5). Really very annoying to get these error sent to every console. 

Intel Xeon 3330, 4x 2GB 64bit ECC, Kingston 800Mhz

[root@bigbang ~]# lsmod|grep edac
i3200_edac             38865  0 
edac_mc                60193  1 i3200_edac

Testing kernel doesn't help my case either.

Comment 4 Tony Luck 2010-05-04 19:37:04 UTC
It would be interesting to study the differences between the systems where this fails, and the lab system that Mauro has where this works.  Could people post dmidecode output from these machines?

Comment 5 Matthias Prager 2010-05-06 01:25:42 UTC
Created attachment 411775 [details]
dmidecode output of my Asus P5BV-C/4L MB

On Tony Luck's request.

Comment 6 David Kovalsky 2010-05-07 09:09:59 UTC
Created attachment 412270 [details]
dmidecode output - UUID 'X-ed' out, Asus P5BV-C/4L

If relevant, modules mentioned in comment #3 are currently blacklisted (not loaded).

Comment 7 manuel wolfshant 2010-05-11 22:11:53 UTC
I have the same problem on a Supermicro system. dmidecode output is attached

Comment 8 manuel wolfshant 2010-05-11 22:14:10 UTC
Created attachment 413276 [details]
dmidecode output for Supermicro 5025B-T

Comment 9 Tony Luck 2010-05-12 17:03:21 UTC
Thanks for the dmidecode files.  All three systems showing the problem have fully populated (4 x 2G) memory for the max that the systems support.

Mauro: Can you track down your lab system on which this code worked to get the dmidecode information from it.

Comment 10 Jens Kuehnel 2010-05-12 18:15:48 UTC
Created attachment 413512 [details]
dmidecode output of gaia  UUID 'X-ed' out, Asus P5BV-C/4L

Comment 11 Jens Kuehnel 2010-05-12 18:18:31 UTC
My dmidecode output is attached.

I had one other observation. I had no problems for around 2 month, after doing a bios reset. Yesterday I made a small change to the bios (no wait on bios error, like missing keyboard) and now the problem is back again. ;-(

Comment 12 Stefan Neufeind 2010-05-16 20:27:11 UTC
Supermicro X7SBi+ with fully equiped RAM-slots (4 banks with 4x2GB)
dmidecode-output is attached.

Comment 13 Stefan Neufeind 2010-05-16 20:27:47 UTC
Created attachment 414407 [details]
dmidecode from Supermicro X7SBi+ with 4x2GB RAM

Comment 14 unit 2010-05-19 09:36:00 UTC
I have 2 ASUS RS120-E5/PA4 (Board: P5BV-R).

One with 2x2GB RAM and second with 4x2GB RAM.

First server (2x2GB RAM) work with 2.6.18-194.el5 kernel without problem.
Second server (4x2GB RAM) work with errors.

Comment 16 unit 2010-05-19 11:38:31 UTC
I'm just updated to 2.6.18-194.3.1.el5 kernel, but problem is still exist.

Comment 17 Mauro Carvalho Chehab 2010-05-19 12:36:33 UTC
It seems to me that some BIOS'es are doing something wrong, causing those troubles. The temporary solution is to add i3200_edac to /etc/modprobe.d/blacklist:

blacklist i3200_edac

This will disable the load of the EDAC module, preventing the error to happen.

Comment 19 Jens Kuehnel 2010-05-19 12:50:12 UTC
Mauro: I already did this, but I would like to detect memory-errors, before they ruin data. Is there a possibility to fix that without changes in the BIOS?

Also you have to deactivate /etc/cron.daily/edac, otherwise you will get an error every day.

Comment 21 unit 2010-05-19 12:57:21 UTC
Mauro Carvalho Chehab: i have this problem on all my 17 ASUS RS120-E5/PA4 only
where memory more than 4GB. 

Unlike i don't have no-ASUS with Chipset i3200 series board now and i cann't
test on it.

Comment 22 Stefan Neufeind 2010-05-19 13:02:36 UTC
It seems that several Supermicro-users spoke up in this thread and that it might have to do with some BIOS-specialities. Since we've had good experience with Supermicro tech-support, I've just tried contacting them to see if they maybe can confirm it might have to do with the BIOS.

Comment 23 Mauro Carvalho Chehab 2010-05-19 17:20:43 UTC
Created attachment 415205 [details]
dmidecode output from an IBM System x3250 M2

EDAC seems to be properly running on this machine, and it is not
generating any errors.

The machine has 2GB of RAM.

$ dmesg|grep -i edac
EDAC MC: Ver: 2.0.1 Mar 16 2010
EDAC MC0: Giving out device to i3200_edac i3200: DEV 0000:00:00.0
EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 6, labels ":": i3200 UE

$ cd /sys/devices/system/edac/mc/mc0
$ cat csrow6/ce_count
0
$ cat csrow6/ch0_ce_count
0
$ cat csrow6/ch0_dimm_label
$ cat csrow6/ch1_ce_count
0
$ cat csrow6/ch1_dimm_label
$ cat csrow6/dev_type
Unknown
$ cat csrow6/edac_mode
Unknown
$ cat csrow6/mem_type
DDR2
$ cat csrow6/size_mb
1024
$ cat csrow6/ue_count
1
$ cat csrow7/ce_count
0
$ cat csrow7/ch0_ce_count
0
$ cat csrow7/ch0_dimm_label
$ cat csrow7/ch1_ce_count
0
$ cat csrow7/ch1_dimm_label
$ cat csrow7/dev_type
Unknown
$ cat csrow7/edac_mode
Unknown
$ cat csrow7/mem_type
DDR2
$ cat csrow7/size_mb
1024
$ cat csrow7/ue_count
0

Comment 24 Mauro Carvalho Chehab 2010-05-19 17:34:24 UTC
(In reply to comment #19)
> Mauro: I already did this, but I would like to detect memory-errors, before
> they ruin data. Is there a possibility to fix that without changes in the BIOS?
> 
> Also you have to deactivate /etc/cron.daily/edac, otherwise you will get an
> error every day.    

If it is a BIOS problem, the solution would be to get a fixed BIOS. AFAIK, there's nothing that the driver can do to solve it.

Does the EDAC driver detect the correct device info? You can check it by running something like:

cd /sys/devices/system/edac/mc/mc0/; for i in csrow*/*; do echo "$ cat $i"; cat $i; done

Comment 25 Tony Luck 2010-05-19 17:44:24 UTC
An alternative to the "blame the BIOS" theory is the >4GB theory mentioned above in comment #21.  Perhaps one of the users who has problems and 8G of memory can pull out a couple of DIMMs for a test to see if the errors are still reported. And/or someone with a working system can borrow enough 2G DIMMs to try their machine with 4x2GB to see if it still works.

Comment 26 Mauro Carvalho Chehab 2010-05-19 17:55:56 UTC
(In reply to comment #25)
> An alternative to the "blame the BIOS" theory is the >4GB theory mentioned
> above in comment #21.  Perhaps one of the users who has problems and 8G of
> memory can pull out a couple of DIMMs for a test to see if the errors are still
> reported. And/or someone with a working system can borrow enough 2G DIMMs to
> try their machine with 4x2GB to see if it still works.    

I'm investigating this one. I've found a problem on the driver, if it have more than one i3200 chipset inside. Not sure if such architecture is possible, but I'm already working on a patch to remove the static vars from the driver.

Comment 27 Mauro Carvalho Chehab 2010-05-19 18:03:03 UTC
Created attachment 415217 [details]
Don't use static local vars at i3200 probe logic

This patch is currently untested, but it may fix the bug, if it is related to having more than one memory controller at the machine. I'll be building a test kernel and run it on some machines to see if this patch won't break PCI probe.

Comment 28 Mauro Carvalho Chehab 2010-05-19 21:58:04 UTC
The new test kernel with the two patches applied are available at:

http://people.redhat.com/~mchehab/.bz564274/

I repeated the tests at the IBM x3250 M2 and it keeps properly working.

Please test it.

Comment 29 unit 2010-05-20 09:40:59 UTC
Oh. I tested it and problem is still exist.

But I have additional info now.

I try remove 4Gb memory from 8GB server but get errors.

So I try add 4GB memory to 4GB server and then remove it. Problem appeared.

My BIOS has settings "Memory Remap Feature" (ENABLE: allow remapping of overlapped PCI memory above the total physical memory. DISABLE: Do not allow remapping of memory) and it's set as "Enable". When I set it to disable state, problem disappeared, but board see only 4GB memory :(

Comment 30 unit 2010-05-20 10:00:15 UTC
(In reply to comment #24)
> 
> Does the EDAC driver detect the correct device info? You can check it by
> running something like:
> 
> cd /sys/devices/system/edac/mc/mc0/; for i in csrow*/*; do echo "$ cat $i"; cat
> $i; done    

$ cat csrow0/ce_count


What should i see?

# cat mc_name
i3200

# ls | grep csrow
csrow0
csrow1
csrow4
csrow5

# cd /sys/devices/system/edac/mc/mc0/csrow0

# cat dev_type
Unknown

# cat mem_type
DDR2

# cat edac_mode
Unknown

Comment 31 Mauro Carvalho Chehab 2010-05-20 20:45:21 UTC
Created attachment 415507 [details]
dmidecode output from a HP DL320G5 with 8Gb

The <4Gb thesis got to /dev/null. A test on a HP DL320G5 with 8Gb worked as expected. It properly detected the 8 1Gb DIMMS, and no errors are generated.

So, the bug really seems how the BIOS is handling memory.

I suspect that BIOS is, somehow, overlapping the PCI device mmapped RAM by something else, causing the trouble, when the remap mode is enabled.

That's the direct EDAC readings, on the HP machine:

# cd /sys/devices/system/edac/mc/mc0/; for i in csrow*/*; do echo "# cat $i"; cat $i; done
# cat csrow0/ce_count
0
# cat csrow0/ch0_ce_count
0
# cat csrow0/ch0_dimm_label
# cat csrow0/ch1_ce_count
0
# cat csrow0/ch1_dimm_label
# cat csrow0/dev_type
Unknown
# cat csrow0/edac_mode
Unknown
# cat csrow0/mem_type
DDR2
# cat csrow0/size_mb
1024
# cat csrow0/ue_count
0
# cat csrow1/ce_count
0
# cat csrow1/ch0_ce_count
0
# cat csrow1/ch0_dimm_label
# cat csrow1/ch1_ce_count
0
# cat csrow1/ch1_dimm_label
# cat csrow1/dev_type
Unknown
# cat csrow1/edac_mode
Unknown
# cat csrow1/mem_type
DDR2
# cat csrow1/size_mb
1024
# cat csrow1/ue_count
0
# cat csrow2/ce_count
0
# cat csrow2/ch0_ce_count
0
# cat csrow2/ch0_dimm_label
# cat csrow2/ch1_ce_count
0
# cat csrow2/ch1_dimm_label
# cat csrow2/dev_type
Unknown
# cat csrow2/edac_mode
Unknown
# cat csrow2/mem_type
DDR2
# cat csrow2/size_mb
1024
# cat csrow2/ue_count
0
# cat csrow3/ce_count
0
# cat csrow3/ch0_ce_count
0
# cat csrow3/ch0_dimm_label
# cat csrow3/ch1_ce_count
0
# cat csrow3/ch1_dimm_label
# cat csrow3/dev_type
Unknown
# cat csrow3/edac_mode
Unknown
# cat csrow3/mem_type
DDR2
# cat csrow3/size_mb
1024
# cat csrow3/ue_count
0
# cat csrow4/ce_count
0
# cat csrow4/ch0_ce_count
0
# cat csrow4/ch0_dimm_label
# cat csrow4/ch1_ce_count
0
# cat csrow4/ch1_dimm_label
# cat csrow4/dev_type
Unknown
# cat csrow4/edac_mode
Unknown
# cat csrow4/mem_type
DDR2
# cat csrow4/size_mb
1024
# cat csrow4/ue_count
0
# cat csrow5/ce_count
0
# cat csrow5/ch0_ce_count
0
# cat csrow5/ch0_dimm_label
# cat csrow5/ch1_ce_count
0
# cat csrow5/ch1_dimm_label
# cat csrow5/dev_type
Unknown
# cat csrow5/edac_mode
Unknown
# cat csrow5/mem_type
DDR2
# cat csrow5/size_mb
1024
# cat csrow5/ue_count
0
# cat csrow6/ce_count
0
# cat csrow6/ch0_ce_count
0
# cat csrow6/ch0_dimm_label
# cat csrow6/ch1_ce_count
0
# cat csrow6/ch1_dimm_label
# cat csrow6/dev_type
Unknown
# cat csrow6/edac_mode
Unknown
# cat csrow6/mem_type
DDR2
# cat csrow6/size_mb
1024
# cat csrow6/ue_count
0
# cat csrow7/ce_count
0
# cat csrow7/ch0_ce_count
0
# cat csrow7/ch0_dimm_label
# cat csrow7/ch1_ce_count
0
# cat csrow7/ch1_dimm_label
# cat csrow7/dev_type
Unknown
# cat csrow7/edac_mode
Unknown
# cat csrow7/mem_type
DDR2
# cat csrow7/size_mb
1024
# cat csrow7/ue_count
0

Comment 32 Mauro Carvalho Chehab 2010-05-24 16:31:34 UTC
Created attachment 416170 [details]
HP DL320G5 BIOS configuration

The HP Proliant BIOS doesn't have any "enable remap feature" configuration. I'm enclosing the current bios configuration as reference.

Comment 33 David Laube 2010-06-21 18:05:33 UTC
I had this problem on our ASUS RS120-E5/PA4 R. We were able to successfully work around it by adding "blacklist i3200_edac"  to /etc/modprobe.d/blacklist followed by a reboot.

Comment 34 Jon Thomas 2010-06-22 14:57:50 UTC
Base Board Information
       Manufacturer: ASUSTeK Computer INC.
       Product Name: P5BV-M
       Version: Rev 1.xxG

output from command:

cd /sys/devices/system/edac/mc/mc0/; for i in csrow*/*; do echo "# cat $i";cat $i; done

# cat csrow0/ce_count
0
# cat csrow0/ch0_ce_count
0
# cat csrow0/ch0_dimm_label
# cat csrow0/ch1_ce_count
0
# cat csrow0/ch1_dimm_label
# cat csrow0/dev_type
Unknown
# cat csrow0/edac_mode
Unknown
# cat csrow0/mem_type
DDR2
# cat csrow0/size_mb
1024
# cat csrow0/ue_count
35
# cat csrow1/ce_count
0
# cat csrow1/ch0_ce_count
0
# cat csrow1/ch0_dimm_label
# cat csrow1/ch1_ce_count
0
# cat csrow1/ch1_dimm_label
# cat csrow1/dev_type
Unknown
# cat csrow1/edac_mode
Unknown
# cat csrow1/mem_type
DDR2
# cat csrow1/size_mb
1024
# cat csrow1/ue_count
18
# cat csrow2/ce_count
0
# cat csrow2/ch0_ce_count
0
# cat csrow2/ch0_dimm_label
# cat csrow2/ch1_ce_count
0
# cat csrow2/ch1_dimm_label
# cat csrow2/dev_type
Unknown
# cat csrow2/edac_mode
Unknown
# cat csrow2/mem_type
DDR2
# cat csrow2/size_mb
1024
# cat csrow2/ue_count
54
# cat csrow3/ce_count
0
# cat csrow3/ch0_ce_count
0
# cat csrow3/ch0_dimm_label
# cat csrow3/ch1_ce_count
0
# cat csrow3/ch1_dimm_label
# cat csrow3/dev_type
Unknown
# cat csrow3/edac_mode
Unknown
# cat csrow3/mem_type
DDR2
# cat csrow3/size_mb
1024
# cat csrow3/ue_count
181
# cat csrow4/ce_count
0
# cat csrow4/ch0_ce_count
0
# cat csrow4/ch0_dimm_label
# cat csrow4/ch1_ce_count
0
# cat csrow4/ch1_dimm_label
# cat csrow4/dev_type
Unknown
# cat csrow4/edac_mode
Unknown
# cat csrow4/mem_type
DDR2
# cat csrow4/size_mb
1024
# cat csrow4/ue_count
50
# cat csrow5/ce_count
0
# cat csrow5/ch0_ce_count
0
# cat csrow5/ch0_dimm_label
# cat csrow5/ch1_ce_count
0
# cat csrow5/ch1_dimm_label
# cat csrow5/dev_type
Unknown
# cat csrow5/edac_mode
Unknown
# cat csrow5/mem_type
DDR2
# cat csrow5/size_mb
1024
# cat csrow5/ue_count
47
# cat csrow6/ce_count
0
# cat csrow6/ch0_ce_count
0
# cat csrow6/ch0_dimm_label
# cat csrow6/ch1_ce_count
0
# cat csrow6/ch1_dimm_label
# cat csrow6/dev_type
Unknown
# cat csrow6/edac_mode
Unknown
# cat csrow6/mem_type
DDR2
# cat csrow6/size_mb
1024
# cat csrow6/ue_count
79
# cat csrow7/ce_count
0
# cat csrow7/ch0_ce_count
0
# cat csrow7/ch0_dimm_label
# cat csrow7/ch1_ce_count
0
# cat csrow7/ch1_dimm_label
# cat csrow7/dev_type
Unknown
# cat csrow7/edac_mode
Unknown
# cat csrow7/mem_type
DDR2
# cat csrow7/size_mb
1024
# cat csrow7/ue_count
113

Comment 35 Mauro Carvalho Chehab 2010-07-01 18:46:02 UTC
The bug seems to happen only with the "Memory Remap Feature".

Question: is the bug happening with x86_64 kernels, or only with i386 kernels with PAE enabled?

Comment 36 unit 2010-07-01 20:03:23 UTC
I'm use x86_64 kernel. 

And also i tested non-ASUS servers. The Bug isn't appear.

Comment 37 David Kovalsky 2010-07-01 21:27:35 UTC
x86_64 kernel for my case too.

Comment 38 BCrook 2010-07-13 20:40:56 UTC
I'm experiencing this same bug on 2.6.18-194.8.1.el5 x86_64 install.  The motherboard is an Asus RS100-E5/PI2 Barebones with P5BV-M/RS100-E5 motherboard with BIOS 0211 (latest).  I have an E5300 Wolfdale CPU, and two 2GB 240-Pin DDR2 SDRAM ECC DDR2 800 (PC2 6400) Intel Certified Server Memory Model KVR800D2E5/2GI.

I experience the problem with both dimms installed, and with either dimm installed in the lowest slot.

# modprobe i3200_edac ; sleep 5; rmmod i3200_edac
kernel: EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 2, labels ":": i3200 UE

As the message repeats, the row will change randomly between 2, 3, 6, and 7.  The grain always stays at exactly 2^30.  I tried with the BIOS' "Memory Remap Feature" enabled and disabled.  It had no effect.  I will start the 32bit download...

This system is not in production yet if someone wants me to try a patch.

Comment 39 Stefan Neufeind 2010-07-13 21:49:19 UTC
x86_64 here as well

Comment 40 Iain Kay 2010-07-20 15:41:30 UTC
I am also having this problem on a server I'm renting. It was delivered with CentOS 5.4 and had no issues then after upgrading to CentOS 5.5 the error started to appear.
I then stock installed CentOS using PXE + VNC boot and after accessing the new installation over SSH the issue still existed.

The server has an Asus Rs100-e5 motherboard, Intel Core 2 Quad Q8300 CPU, 4GB of DDR2 Ram. Kernel is at 2.6.18-194.8.1.el5.

Comment 41 Iain Kay 2010-07-21 00:02:56 UTC
Just a heads up that should you wish to keep the edac module running, and logging these errors, but not destroy your console with loads of messages then you can use the following commands:

echo "0" > /sys/devices/system/edac/mc/log_ue
echo "0" > /sys/devices/system/edac/mc/log_ce

To keep these after a restart then simply add these two lines to the bottom of /etc/rc.local and it works fine.

Comment 42 manuel wolfshant 2010-10-13 12:19:26 UTC
Did anyone test the new kernels ( 2.6.18-194.17.1 for instance) ? Was the bug fixed or not yet ?

Comment 43 BCrook 2010-10-13 19:06:21 UTC
(In reply to comment #42)
> Did anyone test the new kernels ( 2.6.18-194.17.1 for instance) ? Was the bug
> fixed or not yet ?

I can confirm that the deluge of edacs that existed for me on 2.6.18-194.11.4 is still present in 194.17.1

[bcrook@hostname ~]$ dmesg | tail -n 4 ; uname -a
EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 2, labels ":": i3200 UE
EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 6, labels ":": i3200 UE
EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 3, labels ":": i3200 UE
EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 7, labels ":": i3200 UE
Linux hostname.redacted.com 2.6.18-194.17.1.el5 #1 SMP Wed Sep 29 12:50:31 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Comment 44 Mauro Carvalho Chehab 2010-11-16 17:29:08 UTC
(In reply to comment #42)
> Did anyone test the new kernels ( 2.6.18-194.17.1 for instance) ? Was the bug
> fixed or not yet ?

I couldn't reproduce this bug. It is probably due to some BIOS troubles, on some machines. Unfortunately, some BIOS do bad things with some memory controllers. The recommendation is to disable i3200_edac on those machines.

Comment 45 Hannes Sowa 2010-12-13 16:53:47 UTC
It seems that on some motherboards the memory-controller error handling is correctly initialized *only* if quick-boot mode is disabled. One of our boxes, previously affected by this problem, is now running with i3200_edac in polling mode and has not reported any UE/CE-events so far.

Comment 46 Hannes Sowa 2010-12-13 19:38:52 UTC
I just found further information in the edac-wiki:
http://buttersideup.com/edacwiki/Uninitialized_ECC_bits

The Asus P5BV is mentioned specifically to have this kind of problem. If we have confirmation that this solves the problem, I think we can close this bug.

Comment 47 Jens Kuehnel 2010-12-14 00:01:59 UTC
Hi,

I can confirm that this problem is gone, after changing the BIOS setting and doing a power-cycle/hw reset. (normal reboot does not help)

I therefor think this Bug should be closed as NOTABUG.

Very special Thanks to Hannes Sowa.

CU
Jens

Comment 48 Matthias Prager 2011-01-16 14:07:35 UTC
I can also confirm, that the workaround solves the issue.
I disabled quickboot in the BIOS and got no more false errors.

MB: ASUS P5BV-C/4L
gentoo with 2.6.36 kernel

Comment 49 Mauro Carvalho Chehab 2011-01-28 20:38:41 UTC
So, based on all information we have, this is a BIOS bug. There's nothing we can do to solve it, except to document that, on some i3210 boards, BIOS don't properly enable the error correction checks, and that disabling quickboot may solve the issue.

I'll close this bug with a Technical note.

Comment 50 Mauro Carvalho Chehab 2011-01-28 20:38:41 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Some i3210 BIOSes have problems enabling the hardware checks at the MCU. On those hardware, customers should try to disable Quickboot and/or"Memory Remap Feature" or to disable EDAC drivers. More details can be found on:

https://bugzilla.redhat.com/show_bug.cgi?id=564274


Note You need to log in before you can comment on or make changes to this bug.