Problem Description: ibm-prtmd services does not come up in LS21 with 2.6.31-rc5.2.el5rt kernel. [root@elm9m99 ~]# /etc/init.d/ibm-prtm status ibm-prtmd is stopped [root@elm9m99 ~]# /etc/init.d/ibm-prtm start IBM Real-Time HW Daemon: BIOS Real-Time module loaded. Starting ibm-prtmd: grep: /sys/devices/system/edac/mc/mc0/mc_name: No such file or directory grep: /sys/devices/system/edac/mc/mc0/mc_name: No such file or directory grep: /sys/devices/system/edac/mc/mc0/mc_name: No such file or directory No valid EDAC class found for this machine IBM Real-Time HW Daemon: An error has occurred! [FAILED] IBM Real-Time HW Daemon: Your system may experience System Management Interrupts. IBM Real-Time HW Daemon: Please check your system configuration. It seems that required amd64_edac module is not installed in /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/ . [root@elm9m99 ~]# find /lib/modules/2.6.31-rc5.2.el5rt -name *edac* /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/e752x_edac.ko /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/edac_core.ko /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i3000_edac.ko /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i5000_edac.ko /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i5100_edac.ko /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i5400_edac.ko /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i82975x_edac.ko /lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/x38_edac.ko Hardware affected: LS21
------- Comment From dvhltc.com 2009-08-14 00:28 EDT------- Please always keep Keith CC'd on edac and SMI related bugs. Keith please ensure you agree with the direction here.
------- Comment From sripathik.com 2009-08-26 04:26 EDT------- I checked config files in 2.6.31-rc6.rt6 and I still don't see CONFIG_EDAC_AMD64 set: grep CONFIG_EDAC_AMD64 * kernel-2.6.31-x86_64-rt.config:# CONFIG_EDAC_AMD64 is not set kernel-2.6.31-x86_64-rtdebug.config:# CONFIG_EDAC_AMD64 is not set kernel-2.6.31-x86_64-rttrace.config:# CONFIG_EDAC_AMD64 is not set
------- Comment From gowrishankar.m.com 2009-08-26 05:03 EDT------- (In reply to comment #8) > I checked config files in 2.6.31-rc6.rt6 and I still don't see > CONFIG_EDAC_AMD64 set: I see it is CONFIG_EDAC_AMD64_OPTERON in 2.6.29.5-26.el5rt kernel config. > grep CONFIG_EDAC_AMD64 * > kernel-2.6.31-x86_64-rt.config:# CONFIG_EDAC_AMD64 is not set > kernel-2.6.31-x86_64-rtdebug.config:# CONFIG_EDAC_AMD64 is not set > kernel-2.6.31-x86_64-rttrace.config:# CONFIG_EDAC_AMD64 is not set >
------- Comment From sripathik.com 2009-08-26 07:40 EDT------- 2.6.29.5-26.el5rt has CONFIG_EDAC_AMD64_OPTERON=m and CONFIG_EDAC_K8=m These two config options don't seem to be present in 2.6.31-rc5.3.el5rt kernel configuration.
------- Comment From kmannth.com 2009-08-26 11:07 EDT------- We do not need k8_edac set. Sripathi can you rebuild with CONFIG_EDAC_AMD64=m and see how the driver works?
------- Comment From sripathik.com 2009-08-26 11:49 EDT------- (In reply to comment #11) > We do not need k8_edac set. > > Sripathi can you rebuild with CONFIG_EDAC_AMD64=m and see how the driver works? > Nope, that doesn't fix it. I still get the same errors.
------- Comment From kmannth.com 2009-08-26 12:13 EDT------- Ok the driver trying to load: from dmesg: EDAC amd64_edac: Ver: 3.2.0 Aug 26 2009 EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization amd64_edac: probe of 0000:00:18.2 failed with error -22 EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization amd64_edac: probe of 0000:00:19.2 failed with error -22 I will look into what error -22 is.
------- Comment From kmannth.com 2009-09-15 18:53 EDT------- Update: Using linux-rt-2.6.31-rc6.rt6.11 and enabling CONFIG_EDAC_AMD64=m in the .config the driver is working. (I am still doing some testing) The module is now amd64_edac_mod (this is like mainline). User space changes are needed. I am working on that. " EDAC MC: Ver: 2.1.0 Sep 15 2009 EDAC amd64_edac: Ver: 3.2.0 Sep 15 2009 EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization EDAC MC: Rev F or later detected EDAC MC0: Giving out device to 'amd64_edac' 'RevF': DEV 0000:00:18.2 EDAC MC: Rev F or later detected EDAC MC1: Giving out device to 'amd64_edac' 'RevF': DEV 0000:00:19.2 EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED) "
------- Comment From kmannth.com 2009-09-15 20:35 EDT------- using debug dimms and triggering errors: EDAC amd64 MC1: CE ERROR_ADDRESS= 0x233a62dc0 EDAC amd64 MC1: failed to map error address 0x233a62dc0 to a node EDAC MC1: CE - no information available: amd64_edac EDAC MC1: CE - no information available: amd64_edacError Overflow set EDAC amd64 MC1: ExtErr=(0x8) F10-ECC/K8-Chipkill error EDAC amd64 MC1: BUS ERROR: time-out(no timeout) mem or i/o(mem access) participating processor(local node originated (SRC)) memory transaction type(generic read) cache level(L3/generic) Error Found by: Normal Operation EDAC amd64 MC1: CE ERROR_ADDRESS= 0x32d94d780 EDAC amd64 MC1: failed to map error address 0x32d94d780 to a node EDAC MC1: CE - no information available: amd64_edac EDAC MC1: CE - no information available: amd64_edacError Overflow set EDAC amd64 MC1: ExtErr=(0x8) F10-ECC/K8-Chipkill error EDAC amd64 MC1: BUS ERROR: time-out(no timeout) mem or i/o(mem access) participating processor(local node originated (SRC)) memory transaction type(generic read) cache level(L3/generic) Error Found by: Normal Operation Edac sees errors but will not map the error to a given csrow (dimm). I am looking into this. I see the same behavior with mainline.
------- Comment From kmannth.com 2009-09-17 22:13 EDT------- Ok I have fixed this issue in mainline today. I would like to wait for review and or acceptance of the 2 patchs I just sent out to mainline and the maintainers on amd64_edac. I have identified two small changes that are needed (one of them is non trivial) . Once the maintainer takes the patches I will post them here.
Created attachment 364076 [details] trivial fix to decode edac errors into csrows ------- Comment on attachment From gowrishankar.m.com 2009-10-08 06:08 EDT------- This patch makes edac errors to be mapped into csrows. Tested with ECC debug dimm on RevF CPU based system.
Created attachment 364077 [details] map errors on channel 1 as well into csrows ------- Comment on attachment From gowrishankar.m.com 2009-10-08 06:17 EDT------- Patch to map errors on channel1 as well into csrows. Tested with ECC debug dimm on RevF CPU based system.
------- Comment From gowrishankar.m.com 2009-10-08 06:36 EDT------- Tested the patches (id 48656 and 48658) on top of 2.6.31-rt11.20. Please consider the above patches for the defect. Thanks.
------- Comment From gowrishankar.m.com 2009-10-08 07:23 EDT------- For more info on output with fix: [root@llm53 ~]# /etc/init.d/ibm-prtm start IBM Real-Time HW Daemon: System Management Interrupts have been disabled to IBM Real-Time HW Daemon: allow this system to run in Real-Time Mode. [root@llm53 ~]# /etc/init.d/ibm-prtm status ibm-prtmd (pid 3180) is running... [root@llm53 ~]# lsmod | grep edac amd64_edac_mod 26464 0 edac_core 49204 4 amd64_edac_mod [root@llm53 ~]# cat /sys/devices/system/edac/mc/mc0/mc_name RevF [root@llm53 ~]# cat /sys/devices/system/edac/mc/mc1/mc_name RevF
For patches in attachments 48656 and 48658: Signed-off-by: Darren Hart <dvhltc.com>
Created attachment 364180 [details] patch-amd-csfix.patch This patch is the same as 364077 but it has the header with the proper Signed-off-by
------- Comment From sripathik.com 2009-10-16 12:32 EDT------- We verified that the two patches are present in kernel-rt-2.6.31.2-rt13.23.el5rt.src.rpm. However, config param CONFIG_EDAC_AMD64 is not yet enabled. [root@llm53 boot]# grep CONFIG_EDAC_AMD64 config-2.6.31.2-rt13.23.el5rt # CONFIG_EDAC_AMD64 is not set [root@llm53 boot]# dmidecode | grep LS2 Product Name: BladeCenter LS21 -[79716AA]-
------- Comment From gowrishankar.m.com 2009-10-27 08:13 EDT------- Verified with 2.6.31.4-rt14.27.el5rt that ibm-prtmd service comes up after installing the kernel. [root@elm9m89 ~]# /etc/init.d/ibm-prtm status ibm-prtmd (pid 3787) is running... [root@elm9m89 ~]# lsmod | grep edac amd64_edac_mod 24768 0 edac_core 49172 4 amd64_edac_mod [root@elm9m89 ~]# grep CONFIG_EDAC_AMD64 /boot/config-2.6.31.4-rt14.27.el5rt CONFIG_EDAC_AMD64=m # CONFIG_EDAC_AMD64_ERROR_INJECTION is not set [root@elm9m89 ~]# uname -a Linux elm9m89 2.6.31.4-rt14.27.el5rt #1 SMP PREEMPT RT Thu Oct 22 16:45:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux [root@elm9m89 ~]# dmidecode | grep LS2 Product Name: BladeCenter LS21 -[7971AC1]-
------- Comment From vernux.com 2010-03-29 18:39 EDT------- I have verified that the amd64_edac_mod driver is in the 2.6.33.1-rt11.9.el5rt kernel as well. The driver loads, but the current version of ibm-prtmd is too old. We need to get ibm-prtm updated to version 1.6 (from sourceforge). That said, I have not yet tested the dimm mapping on the 2.6.33-rt kernel on the LS21 yet. It looks like the memory topology gets reported differently -- only on memory controller instead of two. Either way, I think this bug can be closed, since the driver is there and that's what this bug is about.
This fix is upstream in at least 2.6.33, so closing this.