Bug 501906
| Summary: | EDAC of Redhat impact system BIOS and BMC function | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | king <king.wong> |
| Component: | edac-utils | Assignee: | Mauro Carvalho Chehab <mchehab> |
| Status: | CLOSED NOTABUG | QA Contact: | qe-baseos-daemons |
| Severity: | high | Docs Contact: | |
| Priority: | low | ||
| Version: | 5.2 | CC: | denis, gcase, lwang, mchehab |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | i386 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2010-07-01 17:46:05 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This may not actually be EDAC's fault. I'm convinced this is caused by the x86_64 MCE checker (arch/x86_64/kernel/mce.c) (i.e. the module that manages /dev/mcelog). This module is compiled by default into the RHEL5 kernel, and its init code clears up the Opteron MCE status registers. It's possible to disable the init code with the kernel option 'nomce' or 'mce=off' As a side note, can this cause a race condition ? On one hand, the MCE checker reads (and clears) the Opteron MCE status registers every 5 minutes. On the other hand, the EDAC k8_edac module tries to access the same data (by default every second, but only after the k8_edac.ko module is loaded). I would ping Bill Nottingham on this. for AMD64, CE (corrected errors) should be polled. MCE code does that, but in on a big cycle. EDAC does poll more often (once a second). The risk here is EDAC losing the CE and being reported by mcelog. Since CEs are usually reported many times, this isn't of a big concern. For UE (uncorrected errors), the error will be reported as a MCE and the MCE code will handle it. using mce=off like Denis pointed should work. Also, there's the possibility of blacklisting k8_edac in /etc/modprobe.d/ and using mce=off to disable the MCE code completely. This seems to be related to some troubles at the BMC agent running at the machine. There are two possible scenarios to get those hardware errors. The first one is via mce. I suspect that BMC is relying on it. By design, MCE will detect and poll errors on every 5 minutes, returning only the 32 last errors. If the machine gots more than 32 errors, some errors will be lost. The second scenario is to get memory errors via EDAC. In this case, the errors are polled on every second, reducing the risks of loosing errors. When EDAC module is enabled, the error may not be seen via MCE. If the customer is interested only on memory errors, it can disable mce, as suggested, and rely on EDAC. If, on the other hand, he gets less then 32 errors per 5 min interval, and a 5 minutes poll is enough for him, he can blacklist k8_edac and rely only on mcelog. If none of the above is enough, then the proper solution is to request BMC to add support for EDAC on their event log agents. As far as I know, they have already the capability of parsing the syslog, so it seems that it is only a matter of parsing the EDAC messages on their products. In any case, I don't see anything that we could do at edac-utils in order to solve the customer issues. So, I'm closing this BZ as NOTABUG. |
Description of problem: Our custormer have found some problems on the BMC event logging during Redhat os running,when we tried to create some single memeory ECC and we found BMC doesn't log these event but EDAC module will get these.it looks like EDAC module will inpact system BIOS/BMC logging event function.and EDAC module will mask system SMI and get the error message and clear system error status. we don't think it's ok. ;------------ CPU:AMD Quad-Core 2389/ Dual-Core OS: RHEL5.2 32bit/64bit(kernel: 2.6.18-53.el5xen) RHEL4.6 32bit/64bit(Kernel: 2.6.9-67.ELsmp) But we found that the ECC error are loged by Linux itself. [root@nsgsh-dhcp-163 log]# more /var/log/messages | grep edac Apr 9 04:53:00 nsgsh-dhcp-163 kernel: MC: drivers/edac/edac_mc.c version MC $Revision: 1.3 $ Apr 9 04:53:00 nsgsh-dhcp-163 kernel: MC0: Giving out device to k8_edac Athlon64/Opteron: PCI 0000:00:18.2 (0000:00:18.2) Apr 9 04:53:00 nsgsh-dhcp-163 kernel: MC1: Giving out device to k8_edac Athlon64/Opteron: PCI 0000:00:19.2 (0000:00:19.2) Apr 9 04:58:01 nsgsh-dhcp-163 kernel: MC1: CE page 0x7dc0c, offset 0x278, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac Apr 9 04:58:02 nsgsh-dhcp-163 kernel: MC1: CE page 0x37e92, offset 0xa18, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac Apr 9 04:58:02 nsgsh-dhcp-163 kernel: MC1: CE - no information available: k8_edac Error Overflow set Apr 9 04:58:03 nsgsh-dhcp-163 kernel: MC1: CE page 0x66e08, offset 0x280, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac Apr 9 04:58:03 nsgsh-dhcp-163 kernel: MC1: CE - no information available: k8_edac Error Overflow set root@nsgsh-dhcp-163 log]# dmesg | grep edac MC: drivers/edac/edac_mc.c version MC $Revision: 1.3 $ MC0: Giving out device to k8_edac Athlon64/Opteron: PCI 0000:00:18.2 (0000:00:18.2) MC1: Giving out device to k8_edac Athlon64/Opteron: PCI 0000:00:19.2 (0000:00:19.2) MC1: CE page 0x7dc0c, offset 0x278, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac MC1: CE page 0x37e92, offset 0xa18, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac MC1: CE - no information available: k8_edac Error Overflow set MC1: CE page 0x66e08, offset 0x280, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac MC1: CE - no information available: k8_edac Error Overflow set MC1: CE page 0x37ffb, offset 0x118, grain 8, syndrome 0x63e1, row 6, channel 1, label "": k8_edac Version-Release number of selected component (if applicable): How reproducible: BOOT to OS and create the ECC error and check the BMC event log/EDAC event log. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: EDAC moudle doesn't intercept any HW error before BIOS or BMC will take over them. Additional info: