Bug 458133

Summary: edac_mc reporting errors on Intel 5000 based systems
Product: Red Hat Enterprise Linux 5 Reporter: Brian C. Lane <bcl>
Component: kernelAssignee: Aristeu Rozanski <arozansk>
Status: CLOSED DUPLICATE QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: acme, fhirtz, jarod, kernel-mgr, tao
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-11-21 13:31:06 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
patch 1/2
none
patch 2/2 none

Description Brian C. Lane 2008-08-06 12:21:05 EDT
Description of problem:
EDAC kernel module is reporting errors:


EDAC i5000 MC0: FATAL ERRORS Found!!! 1st FATAL Err Reg= 0x4
EDAC i5000 MC0: >Tmid Thermal event with intelligent throttling disabled
EDAC MC0: UE row 2, channel-a= 0 channel-b= 1 labels "-": (Branch=0 DRAM-Bank=3 RDWR=Read RAS=156 CAS=0 FATAL Err=0x4)
EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x20000
EDAC i5000:     NORTHBOUND CRC  Error, bits= 0x20000
EDAC i5000 MC0: FATAL ERRORS Found!!! 1st FATAL Err Reg= 0x4
EDAC i5000 MC0: >Tmid Thermal event with intelligent throttling disabled
EDAC MC0: UE row 3, channel-a= 1 channel-b= 2 labels "-": (Branch=0 DRAM-Bank=0 RDWR=Read RAS=1204 CAS=0 FATAL Err=0x4)
EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x20000
EDAC i5000:     NORTHBOUND CRC  Error, bits= 0x20000

Version-Release number of selected component (if applicable):

Linux sp-49.etelos.com 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:49:24 EDT 2008 i686 i686 i386 GNU/Linux

How reproducible:

Intermittent.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Jarod Wilson 2008-08-06 13:28:26 EDT
Summary from irc conversation with Aris:

This is a BIOS thing, its not critical. Technical explanation: there's memory throttling to avoid the memory to get hot. You can do it yourself or let the chipset do it. This message is because the BIOS initialized it to do it itself and the temperature just got past the middle. Its a known issue, being addressed upstream.

Per Aris, an sosreport would be nice to have, but per Brian, the reporter (from another irc conversation) it might not be possible to get the whole thing out, due to policy and whatnot... If there's specific info needed that can be easily sanitised, we can probably get that though.
Comment 5 Aristeu Rozanski 2008-10-23 11:02:12 EDT
Created attachment 321300 [details]
patch 1/2
Comment 6 Aristeu Rozanski 2008-10-23 11:02:41 EDT
Created attachment 321301 [details]
patch 2/2
Comment 7 Aristeu Rozanski 2008-10-23 16:53:16 EDT
test packages available at
http://people.redhat.com/arozansk/bz458133/
Comment 8 Aristeu Rozanski 2008-10-23 16:53:44 EDT
Please test and tell me how it goes.
Comment 9 Brian C. Lane 2008-10-23 17:23:39 EDT
I no longer have access to the effected systems. Hopefully someone else can give this a try. Thanks!
Comment 10 Aristeu Rozanski 2008-11-21 13:31:06 EST

*** This bug has been marked as a duplicate of bug 471933 ***
Comment 11 Aristeu Rozanski 2008-12-09 10:43:51 EST
*** Bug 450737 has been marked as a duplicate of this bug. ***