Bug 517166
Summary: | amd64_edac module not available while installing 2.6.31-rc5.2.el5rt in LS21 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | IBM Bug Proxy <bugproxy> | ||||||||
Component: | realtime-kernel | Assignee: | John Kacur <jkacur> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | David Sommerseth <davids> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 1.2 | CC: | bhu, dvhltc, jkacur, lgoncalv, ovasik | ||||||||
Target Milestone: | 1.3 | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | All | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2010-09-09 14:33:23 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
IBM Bug Proxy
2009-08-12 19:30:49 UTC
------- Comment From dvhltc.com 2009-08-14 00:28 EDT------- Please always keep Keith CC'd on edac and SMI related bugs. Keith please ensure you agree with the direction here. ------- Comment From sripathik.com 2009-08-26 04:26 EDT------- I checked config files in 2.6.31-rc6.rt6 and I still don't see CONFIG_EDAC_AMD64 set: grep CONFIG_EDAC_AMD64 * kernel-2.6.31-x86_64-rt.config:# CONFIG_EDAC_AMD64 is not set kernel-2.6.31-x86_64-rtdebug.config:# CONFIG_EDAC_AMD64 is not set kernel-2.6.31-x86_64-rttrace.config:# CONFIG_EDAC_AMD64 is not set ------- Comment From gowrishankar.m.com 2009-08-26 05:03 EDT------- (In reply to comment #8) > I checked config files in 2.6.31-rc6.rt6 and I still don't see > CONFIG_EDAC_AMD64 set: I see it is CONFIG_EDAC_AMD64_OPTERON in 2.6.29.5-26.el5rt kernel config. > grep CONFIG_EDAC_AMD64 * > kernel-2.6.31-x86_64-rt.config:# CONFIG_EDAC_AMD64 is not set > kernel-2.6.31-x86_64-rtdebug.config:# CONFIG_EDAC_AMD64 is not set > kernel-2.6.31-x86_64-rttrace.config:# CONFIG_EDAC_AMD64 is not set > ------- Comment From sripathik.com 2009-08-26 07:40 EDT------- 2.6.29.5-26.el5rt has CONFIG_EDAC_AMD64_OPTERON=m and CONFIG_EDAC_K8=m These two config options don't seem to be present in 2.6.31-rc5.3.el5rt kernel configuration. ------- Comment From kmannth.com 2009-08-26 11:07 EDT------- We do not need k8_edac set. Sripathi can you rebuild with CONFIG_EDAC_AMD64=m and see how the driver works? ------- Comment From sripathik.com 2009-08-26 11:49 EDT------- (In reply to comment #11) > We do not need k8_edac set. > > Sripathi can you rebuild with CONFIG_EDAC_AMD64=m and see how the driver works? > Nope, that doesn't fix it. I still get the same errors. ------- Comment From kmannth.com 2009-08-26 12:13 EDT------- Ok the driver trying to load: from dmesg: EDAC amd64_edac: Ver: 3.2.0 Aug 26 2009 EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization amd64_edac: probe of 0000:00:18.2 failed with error -22 EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization amd64_edac: probe of 0000:00:19.2 failed with error -22 I will look into what error -22 is. ------- Comment From kmannth.com 2009-09-15 18:53 EDT------- Update: Using linux-rt-2.6.31-rc6.rt6.11 and enabling CONFIG_EDAC_AMD64=m in the .config the driver is working. (I am still doing some testing) The module is now amd64_edac_mod (this is like mainline). User space changes are needed. I am working on that. " EDAC MC: Ver: 2.1.0 Sep 15 2009 EDAC amd64_edac: Ver: 3.2.0 Sep 15 2009 EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization EDAC MC: Rev F or later detected EDAC MC0: Giving out device to 'amd64_edac' 'RevF': DEV 0000:00:18.2 EDAC MC: Rev F or later detected EDAC MC1: Giving out device to 'amd64_edac' 'RevF': DEV 0000:00:19.2 EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED) " ------- Comment From kmannth.com 2009-09-15 20:35 EDT------- using debug dimms and triggering errors: EDAC amd64 MC1: CE ERROR_ADDRESS= 0x233a62dc0 EDAC amd64 MC1: failed to map error address 0x233a62dc0 to a node EDAC MC1: CE - no information available: amd64_edac EDAC MC1: CE - no information available: amd64_edacError Overflow set EDAC amd64 MC1: ExtErr=(0x8) F10-ECC/K8-Chipkill error EDAC amd64 MC1: BUS ERROR: time-out(no timeout) mem or i/o(mem access) participating processor(local node originated (SRC)) memory transaction type(generic read) cache level(L3/generic) Error Found by: Normal Operation EDAC amd64 MC1: CE ERROR_ADDRESS= 0x32d94d780 EDAC amd64 MC1: failed to map error address 0x32d94d780 to a node EDAC MC1: CE - no information available: amd64_edac EDAC MC1: CE - no information available: amd64_edacError Overflow set EDAC amd64 MC1: ExtErr=(0x8) F10-ECC/K8-Chipkill error EDAC amd64 MC1: BUS ERROR: time-out(no timeout) mem or i/o(mem access) participating processor(local node originated (SRC)) memory transaction type(generic read) cache level(L3/generic) Error Found by: Normal Operation Edac sees errors but will not map the error to a given csrow (dimm). I am looking into this. I see the same behavior with mainline. ------- Comment From kmannth.com 2009-09-17 22:13 EDT------- Ok I have fixed this issue in mainline today. I would like to wait for review and or acceptance of the 2 patchs I just sent out to mainline and the maintainers on amd64_edac. I have identified two small changes that are needed (one of them is non trivial) . Once the maintainer takes the patches I will post them here. Created attachment 364076 [details]
trivial fix to decode edac errors into csrows
------- Comment on attachment From gowrishankar.m.com 2009-10-08 06:08 EDT-------
This patch makes edac errors to be mapped into csrows.
Tested with ECC debug dimm on RevF CPU based system.
Created attachment 364077 [details]
map errors on channel 1 as well into csrows
------- Comment on attachment From gowrishankar.m.com 2009-10-08 06:17 EDT-------
Patch to map errors on channel1 as well into csrows.
Tested with ECC debug dimm on RevF CPU based system.
------- Comment From gowrishankar.m.com 2009-10-08 06:36 EDT------- Tested the patches (id 48656 and 48658) on top of 2.6.31-rt11.20. Please consider the above patches for the defect. Thanks. ------- Comment From gowrishankar.m.com 2009-10-08 07:23 EDT------- For more info on output with fix: [root@llm53 ~]# /etc/init.d/ibm-prtm start IBM Real-Time HW Daemon: System Management Interrupts have been disabled to IBM Real-Time HW Daemon: allow this system to run in Real-Time Mode. [root@llm53 ~]# /etc/init.d/ibm-prtm status ibm-prtmd (pid 3180) is running... [root@llm53 ~]# lsmod | grep edac amd64_edac_mod 26464 0 edac_core 49204 4 amd64_edac_mod [root@llm53 ~]# cat /sys/devices/system/edac/mc/mc0/mc_name RevF [root@llm53 ~]# cat /sys/devices/system/edac/mc/mc1/mc_name RevF For patches in attachments 48656 and 48658: Signed-off-by: Darren Hart <dvhltc.com> Created attachment 364180 [details]
patch-amd-csfix.patch
This patch is the same as 364077 but it has the header with the proper
Signed-off-by
------- Comment From sripathik.com 2009-10-16 12:32 EDT------- We verified that the two patches are present in kernel-rt-2.6.31.2-rt13.23.el5rt.src.rpm. However, config param CONFIG_EDAC_AMD64 is not yet enabled. [root@llm53 boot]# grep CONFIG_EDAC_AMD64 config-2.6.31.2-rt13.23.el5rt # CONFIG_EDAC_AMD64 is not set [root@llm53 boot]# dmidecode | grep LS2 Product Name: BladeCenter LS21 -[79716AA]- ------- Comment From gowrishankar.m.com 2009-10-27 08:13 EDT------- Verified with 2.6.31.4-rt14.27.el5rt that ibm-prtmd service comes up after installing the kernel. [root@elm9m89 ~]# /etc/init.d/ibm-prtm status ibm-prtmd (pid 3787) is running... [root@elm9m89 ~]# lsmod | grep edac amd64_edac_mod 24768 0 edac_core 49172 4 amd64_edac_mod [root@elm9m89 ~]# grep CONFIG_EDAC_AMD64 /boot/config-2.6.31.4-rt14.27.el5rt CONFIG_EDAC_AMD64=m # CONFIG_EDAC_AMD64_ERROR_INJECTION is not set [root@elm9m89 ~]# uname -a Linux elm9m89 2.6.31.4-rt14.27.el5rt #1 SMP PREEMPT RT Thu Oct 22 16:45:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux [root@elm9m89 ~]# dmidecode | grep LS2 Product Name: BladeCenter LS21 -[7971AC1]- ------- Comment From vernux.com 2010-03-29 18:39 EDT------- I have verified that the amd64_edac_mod driver is in the 2.6.33.1-rt11.9.el5rt kernel as well. The driver loads, but the current version of ibm-prtmd is too old. We need to get ibm-prtm updated to version 1.6 (from sourceforge). That said, I have not yet tested the dimm mapping on the 2.6.33-rt kernel on the LS21 yet. It looks like the memory topology gets reported differently -- only on memory controller instead of two. Either way, I think this bug can be closed, since the driver is there and that's what this bug is about. This fix is upstream in at least 2.6.33, so closing this. |