Bug 471933

Summary: [Brocade/Dell 5.3 bug] hts failing memory test with EDAC i5000 Non-Fatal error
Product: Red Hat Enterprise Linux 5 Reporter: Janice Vatcher <jvatcher>
Component: kernelAssignee: Aristeu Rozanski <arozansk>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.2CC: andriusb, coughlan, cward, dzickus, gnichols, lwang, martinez, mbroz, mgahagan, mmcallis, rlandry, syeghiay, tao, william
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: i686   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 494734 (view as bug list) Environment:
Last Closed: 2009-01-20 14:45:32 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 494734    
Description Flags
Here is the message file. none

Description Janice Vatcher 2008-11-17 13:50:11 EST
Description of problem:
Failing memory test in HTS. 

Error in messages
Nov 16 04:44:02 hba081036 kernel: EDAC i5000:   NON-Retry  Errors, bits= 0x800

Version-Release number of selected component (if applicable):

How reproducible:
Run hts against the system

Steps to Reproduce:
1. run hts discover
2. run hts certify
Actual results: 
memory - FAIL

Expected results:
memory - PASS

Additional info:
Comment 1 Andrius Benokraitis 2008-11-17 14:32:29 EST
Janice - do you have any logs we can look at?
Comment 2 Rob Landry 2008-11-17 14:40:57 EST
The error in description looks to be coming from the kernel, either the VM or the driver.  While I could see how hts might aggravate whatever the problem is I would not suspect it the root cause.  Reassigning to kernel, for their assessment (hts or system logs should assist in their review as well).
Comment 3 Janice Vatcher 2008-11-17 14:46:26 EST
Created attachment 323790 [details]
Here is the message file.

Here is the messages file from the server.
Comment 4 Andrius Benokraitis 2008-11-17 15:03:29 EST
Janice - can you post the HTS logs? It can be captured in the INFO test.
Comment 5 Milan Broz 2008-11-20 10:22:43 EST
I see this on my Dell 490 too, I'll try to fiddle with memory banks but I expect it is just too verbose message...

Console is flooded with
EDAC i5000 MC0: NON-FATAL ERROR Found!!! 1st NON-FATAL Err Reg= 0x800
EDAC i5000: NON-Retry  Errors, bits=  0x800

Linux 2.6.18-123.el5xen #1 SMP Mon Nov 10 18:45:33 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

Seems to similar to bug #458133 ?
Comment 6 Aristeu Rozanski 2008-11-21 08:39:54 EST
Yes, please test the kernel package available at:
Comment 7 Milan Broz 2008-11-21 10:04:04 EST
Yes, 2.6.18-120.el5.458133xe is ok here, no messages.

The message mentioned in comment #5 repeats every second on non-patched kernel, it makes the physical console mostly unusable or it floods logs at least.

Please consider this as blocker for RHEL5.3...

My HW is standard Dell Precision 690 workstation.
Comment 12 Aristeu Rozanski 2008-11-21 13:31:06 EST
*** Bug 458133 has been marked as a duplicate of this bug. ***
Comment 14 Don Zickus 2008-12-02 17:20:22 EST
in kernel-2.6.18-125.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 16 Chris Ward 2008-12-04 05:22:56 EST
Brocade, what is the current status of this bug fix? The fix should be present in the latest RHEL5.3 Snapshot. Please test and send feedback ASAP.
Comment 17 Chris Ward 2008-12-04 10:46:52 EST
Apologies, this fix should be present in the Snapshot 5, which is scheduled for release next week.
Comment 18 Chris Ward 2008-12-08 06:53:28 EST
~~ Snapshot 5 is now available @ partners.redhat.com ~~ 

Partners, RHEL 5.3 Snapshot 5 is now available for testing. Please send us your testing feedback on this important bug fix / feature request AS SOON AS POSSIBLE. If you are unable to test, indicate this in a comment or escalate to your Partner Manager. If we do not receive your test feedback, this bug will be AT RISK of being dropped from the release.

If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla
Keywords field, along with a description of the test results. 

If you encounter a new bug, CLONE this bug and request from your Partner
manager to review. We are no longer excepting new bugs into the release, bar
critical regressions.
Comment 19 Chris Ward 2008-12-11 13:01:11 EST
Brocade, any update?
Comment 21 Chris Ward 2008-12-16 11:29:30 EST
~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible.  If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there.
Comment 22 Janice Vatcher 2008-12-19 17:24:32 EST
I loaded the 5.3 kernel on the same system and ran the certification test twice. I did not receive any EDAC errors in dmesg.
Comment 24 errata-xmlrpc 2009-01-20 14:45:32 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

Comment 26 Murray McAllister 2009-05-19 21:55:57 EDT
The following bug description for this issue was too long to include in the errata:

The i5000_edac module reported all types of errors (including errors not
completely documented) and did not use edac_mc functions to report errors.
Not using the edac_mc functions prevented the error messages from being
filtered or silenced. On certain systems, this resulted in the console
being flooded with errors, for example:

EDAC i5000 MC0: NON-FATAL ERROR Found!!! 1st NON-FATAL Err Reg= [hex value]

EDAC i5000: NON-Retry Errors, bits= [hex value]

Removing the i5000_edac module prevented these errors; however, it may have
prevented other important messages from being reported. After installing
an update, the i5000_edac module uses the edac_mc functions to report
errors, which resolves this issue.

Note: After an update, the i5000_edac module will not report errors that are
not completely documented: these will be disabled by default. To re-enable
these messages, use the i5000_edac "misc_messages=1" module parameter.