Bug 517166

Summary: amd64_edac module not available while installing 2.6.31-rc5.2.el5rt in LS21
Product: Red Hat Enterprise MRG Reporter: IBM Bug Proxy <bugproxy>
Component: realtime-kernelAssignee: John Kacur <jkacur>
Status: CLOSED CURRENTRELEASE QA Contact: David Sommerseth <davids>
Severity: high Docs Contact:
Priority: low    
Version: 1.2CC: bhu, dvhltc, jkacur, lgoncalv, ovasik
Target Milestone: 1.3   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-09-09 14:33:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
trivial fix to decode edac errors into csrows
none
map errors on channel 1 as well into csrows
none
patch-amd-csfix.patch none

Description IBM Bug Proxy 2009-08-12 19:30:49 UTC
Problem Description:

ibm-prtmd services does not come up in LS21 with 2.6.31-rc5.2.el5rt kernel.

[root@elm9m99 ~]# /etc/init.d/ibm-prtm status
ibm-prtmd is stopped

[root@elm9m99 ~]# /etc/init.d/ibm-prtm start
IBM Real-Time HW Daemon: BIOS Real-Time module loaded.
Starting ibm-prtmd: grep: /sys/devices/system/edac/mc/mc0/mc_name: No such file or directory
grep: /sys/devices/system/edac/mc/mc0/mc_name: No such file or directory
grep: /sys/devices/system/edac/mc/mc0/mc_name: No such file or directory
No valid EDAC class found for this machine
IBM Real-Time HW Daemon: An error has occurred!            [FAILED]
IBM Real-Time HW Daemon: Your system may experience System Management Interrupts.
IBM Real-Time HW Daemon: Please check your system configuration.


It seems that required amd64_edac module is not installed in 
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/ .

[root@elm9m99 ~]# find /lib/modules/2.6.31-rc5.2.el5rt -name *edac*
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/e752x_edac.ko
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/edac_core.ko
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i3000_edac.ko
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i5000_edac.ko
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i5100_edac.ko
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i5400_edac.ko
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/i82975x_edac.ko
/lib/modules/2.6.31-rc5.2.el5rt/kernel/drivers/edac/x38_edac.ko


Hardware affected:
LS21

Comment 1 IBM Bug Proxy 2009-08-14 04:30:27 UTC
------- Comment From dvhltc.com 2009-08-14 00:28 EDT-------
Please always keep Keith CC'd on edac and SMI related bugs.  Keith please ensure you agree with the direction here.

Comment 2 IBM Bug Proxy 2009-08-26 08:30:27 UTC
------- Comment From sripathik.com 2009-08-26 04:26 EDT-------
I checked config files in 2.6.31-rc6.rt6 and I still don't see CONFIG_EDAC_AMD64 set:
grep CONFIG_EDAC_AMD64 *
kernel-2.6.31-x86_64-rt.config:# CONFIG_EDAC_AMD64 is not set
kernel-2.6.31-x86_64-rtdebug.config:# CONFIG_EDAC_AMD64 is not set
kernel-2.6.31-x86_64-rttrace.config:# CONFIG_EDAC_AMD64 is not set

Comment 3 IBM Bug Proxy 2009-08-26 09:11:27 UTC
------- Comment From gowrishankar.m.com 2009-08-26 05:03 EDT-------
(In reply to comment #8)
> I checked config files in 2.6.31-rc6.rt6 and I still don't see
> CONFIG_EDAC_AMD64 set:

I see it is CONFIG_EDAC_AMD64_OPTERON in 2.6.29.5-26.el5rt kernel config.

> grep CONFIG_EDAC_AMD64 *
> kernel-2.6.31-x86_64-rt.config:# CONFIG_EDAC_AMD64 is not set
> kernel-2.6.31-x86_64-rtdebug.config:# CONFIG_EDAC_AMD64 is not set
> kernel-2.6.31-x86_64-rttrace.config:# CONFIG_EDAC_AMD64 is not set
>

Comment 4 IBM Bug Proxy 2009-08-26 11:40:40 UTC
------- Comment From sripathik.com 2009-08-26 07:40 EDT-------
2.6.29.5-26.el5rt has
CONFIG_EDAC_AMD64_OPTERON=m
and
CONFIG_EDAC_K8=m

These two config options don't seem to be present in 2.6.31-rc5.3.el5rt kernel configuration.

Comment 5 IBM Bug Proxy 2009-08-26 15:11:06 UTC
------- Comment From kmannth.com 2009-08-26 11:07 EDT-------
We do not need k8_edac set.

Sripathi can you rebuild with CONFIG_EDAC_AMD64=m and see how the driver works?

Comment 6 IBM Bug Proxy 2009-08-26 15:51:47 UTC
------- Comment From sripathik.com 2009-08-26 11:49 EDT-------
(In reply to comment #11)
> We do not need k8_edac set.
>
> Sripathi can you rebuild with CONFIG_EDAC_AMD64=m and see how the driver works?
>

Nope, that doesn't fix it. I still get the same errors.

Comment 7 IBM Bug Proxy 2009-08-26 16:18:06 UTC
------- Comment From kmannth.com 2009-08-26 12:13 EDT-------
Ok the driver trying to load:
from dmesg:

EDAC amd64_edac:  Ver: 3.2.0 Aug 26 2009
EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization
amd64_edac: probe of 0000:00:18.2 failed with error -22
EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization
amd64_edac: probe of 0000:00:19.2 failed with error -22

I will look into what error -22 is.

Comment 8 IBM Bug Proxy 2009-09-15 23:00:24 UTC
------- Comment From kmannth.com 2009-09-15 18:53 EDT-------
Update:

Using linux-rt-2.6.31-rc6.rt6.11 and enabling

CONFIG_EDAC_AMD64=m

in the .config the driver is working. (I am still doing some testing)

The module is now amd64_edac_mod (this is like mainline).

User space changes are needed.  I am working on that.

"
EDAC MC: Ver: 2.1.0 Sep 15 2009
EDAC amd64_edac:  Ver: 3.2.0 Sep 15 2009
EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization
EDAC amd64: ECC is enabled by BIOS, Proceeding with EDAC module initialization
EDAC MC: Rev F or later detected
EDAC MC0: Giving out device to 'amd64_edac' 'RevF': DEV 0000:00:18.2
EDAC MC: Rev F or later detected
EDAC MC1: Giving out device to 'amd64_edac' 'RevF': DEV 0000:00:19.2
EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED)
"

Comment 9 IBM Bug Proxy 2009-09-16 00:40:23 UTC
------- Comment From kmannth.com 2009-09-15 20:35 EDT-------
using debug dimms and triggering errors:

EDAC amd64 MC1: CE ERROR_ADDRESS= 0x233a62dc0
EDAC amd64 MC1: failed to map error address 0x233a62dc0 to a node
EDAC MC1: CE - no information available: amd64_edac
EDAC MC1: CE - no information available: amd64_edacError Overflow set
EDAC amd64 MC1: ExtErr=(0x8) F10-ECC/K8-Chipkill error
EDAC amd64 MC1: BUS ERROR:
time-out(no timeout) mem or i/o(mem access)
participating processor(local node originated (SRC))
memory transaction type(generic read)
cache level(L3/generic) Error Found by: Normal Operation
EDAC amd64 MC1: CE ERROR_ADDRESS= 0x32d94d780
EDAC amd64 MC1: failed to map error address 0x32d94d780 to a node
EDAC MC1: CE - no information available: amd64_edac
EDAC MC1: CE - no information available: amd64_edacError Overflow set
EDAC amd64 MC1: ExtErr=(0x8) F10-ECC/K8-Chipkill error
EDAC amd64 MC1: BUS ERROR:
time-out(no timeout) mem or i/o(mem access)
participating processor(local node originated (SRC))
memory transaction type(generic read)
cache level(L3/generic) Error Found by: Normal Operation

Edac sees errors but will not map the error to a given csrow (dimm).  I am looking into this.  I see the same behavior with mainline.

Comment 10 IBM Bug Proxy 2009-09-18 02:30:39 UTC
------- Comment From kmannth.com 2009-09-17 22:13 EDT-------
Ok I have fixed this issue in mainline today.  I would like to wait for review and or acceptance of the 2 patchs I just sent out to mainline and the maintainers on amd64_edac.

I have identified two small changes that are needed (one of them is non trivial) .  Once the maintainer takes the patches I will post them here.

Comment 11 IBM Bug Proxy 2009-10-08 10:12:34 UTC
Created attachment 364076 [details]
trivial fix to decode edac errors into csrows


------- Comment on attachment From gowrishankar.m.com 2009-10-08 06:08 EDT-------


This patch makes edac errors to be mapped into csrows.
Tested with ECC debug dimm on RevF CPU based system.

Comment 12 IBM Bug Proxy 2009-10-08 10:22:01 UTC
Created attachment 364077 [details]
map errors on channel 1 as well into csrows


------- Comment on attachment From gowrishankar.m.com 2009-10-08 06:17 EDT-------


Patch to map errors on channel1 as well into csrows.
Tested with ECC debug dimm on RevF CPU based system.

Comment 13 IBM Bug Proxy 2009-10-08 10:40:31 UTC
------- Comment From gowrishankar.m.com 2009-10-08 06:36 EDT-------
Tested the patches (id 48656 and 48658) on top of 2.6.31-rt11.20.

Please consider the above patches for the defect.

Thanks.

Comment 14 IBM Bug Proxy 2009-10-08 11:31:28 UTC
------- Comment From gowrishankar.m.com 2009-10-08 07:23 EDT-------
For more info on output with fix:

[root@llm53 ~]# /etc/init.d/ibm-prtm start
IBM Real-Time HW Daemon: System Management Interrupts have been disabled to
IBM Real-Time HW Daemon: allow this system to run in Real-Time Mode.

[root@llm53 ~]# /etc/init.d/ibm-prtm status
ibm-prtmd (pid  3180) is running...

[root@llm53 ~]# lsmod | grep edac
amd64_edac_mod         26464  0
edac_core              49204  4 amd64_edac_mod

[root@llm53 ~]# cat /sys/devices/system/edac/mc/mc0/mc_name
RevF

[root@llm53 ~]# cat /sys/devices/system/edac/mc/mc1/mc_name
RevF

Comment 15 Darren Hart 2009-10-08 20:39:34 UTC
For patches in attachments 48656 and 48658:

Signed-off-by: Darren Hart <dvhltc.com>

Comment 16 John Kacur 2009-10-08 20:49:00 UTC
Created attachment 364180 [details]
patch-amd-csfix.patch

This patch is the same as 364077 but it has the header with the proper
Signed-off-by

Comment 17 IBM Bug Proxy 2009-10-16 16:41:02 UTC
------- Comment From sripathik.com 2009-10-16 12:32 EDT-------
We verified that the two patches are present in kernel-rt-2.6.31.2-rt13.23.el5rt.src.rpm. However,  config param CONFIG_EDAC_AMD64 is not yet enabled.

[root@llm53 boot]# grep CONFIG_EDAC_AMD64 config-2.6.31.2-rt13.23.el5rt
# CONFIG_EDAC_AMD64 is not set

[root@llm53 boot]# dmidecode | grep LS2
Product Name: BladeCenter LS21 -[79716AA]-

Comment 18 IBM Bug Proxy 2009-10-27 12:21:40 UTC
------- Comment From gowrishankar.m.com 2009-10-27 08:13 EDT-------
Verified with 2.6.31.4-rt14.27.el5rt that ibm-prtmd service comes
up after installing the kernel.

[root@elm9m89 ~]# /etc/init.d/ibm-prtm status
ibm-prtmd (pid  3787) is running...

[root@elm9m89 ~]# lsmod | grep edac
amd64_edac_mod         24768  0
edac_core              49172  4 amd64_edac_mod

[root@elm9m89 ~]# grep CONFIG_EDAC_AMD64 /boot/config-2.6.31.4-rt14.27.el5rt
CONFIG_EDAC_AMD64=m
# CONFIG_EDAC_AMD64_ERROR_INJECTION is not set

[root@elm9m89 ~]# uname -a
Linux elm9m89 2.6.31.4-rt14.27.el5rt #1 SMP PREEMPT RT Thu Oct 22 16:45:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

[root@elm9m89 ~]# dmidecode | grep LS2
Product Name: BladeCenter LS21 -[7971AC1]-

Comment 19 IBM Bug Proxy 2010-03-29 22:40:43 UTC
------- Comment From vernux.com 2010-03-29 18:39 EDT-------
I have verified that the amd64_edac_mod driver is in the 2.6.33.1-rt11.9.el5rt kernel as well.  The driver loads, but the current version of ibm-prtmd is too old.  We need to get ibm-prtm updated to version 1.6 (from sourceforge).

That said, I have not yet tested the dimm mapping on the 2.6.33-rt kernel on the LS21 yet.  It looks like the memory topology gets reported differently -- only on memory controller instead of two.  Either way, I think this bug can be closed, since the driver is there and that's what this bug is about.

Comment 20 John Kacur 2010-09-09 14:33:23 UTC
This fix is upstream in at least 2.6.33, so closing this.