Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 501906

Summary: EDAC of Redhat impact system BIOS and BMC function
Product: Red Hat Enterprise Linux 5 Reporter: king <king.wong>
Component: edac-utilsAssignee: Mauro Carvalho Chehab <mchehab>
Status: CLOSED NOTABUG QA Contact: qe-baseos-daemons
Severity: high Docs Contact:
Priority: low    
Version: 5.2CC: denis, gcase, lwang, mchehab
Target Milestone: rc   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-07-01 17:46:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description king 2009-05-21 09:42:58 UTC
Description of problem:
Our custormer have found some problems on the BMC event logging during
Redhat os running,when we tried to create some single memeory ECC and
we found BMC doesn't log these event but EDAC module will get these.it
looks like EDAC module will inpact system BIOS/BMC logging event
function.and EDAC module will mask system SMI and get the error
message and clear system error status. we don't think it's ok.
;------------
CPU:AMD Quad-Core 2389/ Dual-Core
OS: RHEL5.2 32bit/64bit(kernel: 2.6.18-53.el5xen)
        RHEL4.6 32bit/64bit(Kernel: 2.6.9-67.ELsmp)

But we found that the ECC error are loged by Linux itself.

[root@nsgsh-dhcp-163 log]# more /var/log/messages | grep edac
Apr  9 04:53:00 nsgsh-dhcp-163 kernel: MC: drivers/edac/edac_mc.c version MC $Revision: 1.3 $
Apr  9 04:53:00 nsgsh-dhcp-163 kernel: MC0: Giving out device to k8_edac Athlon64/Opteron: PCI 0000:00:18.2 (0000:00:18.2)
Apr  9 04:53:00 nsgsh-dhcp-163 kernel: MC1: Giving out device to k8_edac Athlon64/Opteron: PCI 0000:00:19.2 (0000:00:19.2)
Apr  9 04:58:01 nsgsh-dhcp-163 kernel: MC1: CE page 0x7dc0c, offset 0x278, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac
Apr  9 04:58:02 nsgsh-dhcp-163 kernel: MC1: CE page 0x37e92, offset 0xa18, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac
Apr  9 04:58:02 nsgsh-dhcp-163 kernel: MC1: CE - no information available: k8_edac Error Overflow set
Apr  9 04:58:03 nsgsh-dhcp-163 kernel: MC1: CE page 0x66e08, offset 0x280, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac
Apr  9 04:58:03 nsgsh-dhcp-163 kernel: MC1: CE - no information available: k8_edac Error Overflow set

root@nsgsh-dhcp-163 log]# dmesg | grep edac
MC: drivers/edac/edac_mc.c version MC $Revision: 1.3 $
MC0: Giving out device to k8_edac Athlon64/Opteron: PCI 0000:00:18.2 (0000:00:18.2)
MC1: Giving out device to k8_edac Athlon64/Opteron: PCI 0000:00:19.2 (0000:00:19.2)
MC1: CE page 0x7dc0c, offset 0x278, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac
MC1: CE page 0x37e92, offset 0xa18, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac
MC1: CE - no information available: k8_edac Error Overflow set
MC1: CE page 0x66e08, offset 0x280, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac
MC1: CE - no information available: k8_edac Error Overflow set
MC1: CE page 0x37ffb, offset 0x118, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac


Version-Release number of selected component (if applicable):


How reproducible:
BOOT to OS and create the ECC error and check the BMC event log/EDAC event log.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:
EDAC moudle doesn't intercept any HW error before BIOS or BMC will take over them.

Additional info:

Comment 1 Denis Leroy 2009-05-27 07:12:56 UTC
This may not actually be EDAC's fault. I'm convinced this is caused by the x86_64 MCE checker (arch/x86_64/kernel/mce.c) (i.e. the module that manages /dev/mcelog). This module is compiled by default into the RHEL5 kernel, and its init code clears up the Opteron MCE status registers.

It's possible to disable the init code with the kernel option 'nomce' or 'mce=off'

As a side note, can this cause a race condition ? On one hand, the MCE checker reads (and clears) the Opteron MCE status registers every 5 minutes. On the other hand, the EDAC k8_edac module tries to access the same data (by default every second, but only after the k8_edac.ko module is loaded). I would ping Bill Nottingham on this.

Comment 2 Aristeu Rozanski 2009-05-27 21:38:32 UTC
for AMD64, CE (corrected errors) should be polled. MCE code does that, but in
on a big cycle. EDAC does poll more often (once a second). The risk here is EDAC
losing the CE and being reported by mcelog. Since CEs are usually reported many
times, this isn't of a big concern.
For UE (uncorrected errors), the error will be reported as a MCE and the MCE code
will handle it. using mce=off like Denis pointed should work.

Also, there's the possibility of blacklisting k8_edac in /etc/modprobe.d/ and
using mce=off to disable the MCE code completely.

Comment 3 Mauro Carvalho Chehab 2010-07-01 17:46:05 UTC
This seems to be related to some troubles at the BMC agent running at the machine.

There are two possible scenarios to get those hardware errors. The first one is via mce. I suspect that BMC is relying on it. By design, MCE will detect and poll errors on every 5 minutes, returning only the 32 last errors. If the machine gots more than 32 errors, some errors will be lost.

The second scenario is to get memory errors via EDAC. In this case, the errors are polled on every second, reducing the risks of loosing errors. When EDAC module is enabled, the error may not be seen via MCE.

If the customer is interested only on memory errors, it can disable mce, as suggested, and rely on EDAC. If, on the other hand, he gets less then 32 errors per 5 min interval, and a 5 minutes poll is enough for him, he can blacklist k8_edac and rely only on mcelog.

If none of the above is enough, then the proper solution is to request BMC to add support for EDAC on their event log agents. As far as I know, they have already the capability of parsing the syslog, so it seems that it is only a matter of parsing the EDAC messages on their products.

In any case, I don't see anything that we could do at edac-utils in order to solve the customer issues. So, I'm closing this BZ as NOTABUG.