Red Hat Bugzilla – Bug 471933
[Brocade/Dell 5.3 bug] hts failing memory test with EDAC i5000 Non-Fatal error
Last modified: 2012-07-11 21:53:21 EDT
Description of problem:
Failing memory test in HTS.
Error in messages
Nov 16 04:44:02 hba081036 kernel: EDAC i5000: NON-Retry Errors, bits= 0x800
Version-Release number of selected component (if applicable):
Run hts against the system
Steps to Reproduce:
1. run hts discover
2. run hts certify
memory - FAIL
memory - PASS
Janice - do you have any logs we can look at?
The error in description looks to be coming from the kernel, either the VM or the driver. While I could see how hts might aggravate whatever the problem is I would not suspect it the root cause. Reassigning to kernel, for their assessment (hts or system logs should assist in their review as well).
Created attachment 323790 [details]
Here is the message file.
Here is the messages file from the server.
Janice - can you post the HTS logs? It can be captured in the INFO test.
I see this on my Dell 490 too, I'll try to fiddle with memory banks but I expect it is just too verbose message...
Console is flooded with
EDAC i5000 MC0: NON-FATAL ERROR Found!!! 1st NON-FATAL Err Reg= 0x800
EDAC i5000: NON-Retry Errors, bits= 0x800
Linux 2.6.18-123.el5xen #1 SMP Mon Nov 10 18:45:33 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
Seems to similar to bug #458133 ?
Yes, please test the kernel package available at:
Yes, 2.6.18-120.el5.458133xe is ok here, no messages.
The message mentioned in comment #5 repeats every second on non-patched kernel, it makes the physical console mostly unusable or it floods logs at least.
Please consider this as blocker for RHEL5.3...
My HW is standard Dell Precision 690 workstation.
*** Bug 458133 has been marked as a duplicate of this bug. ***
You can download this test kernel from http://people.redhat.com/dzickus/el5
Brocade, what is the current status of this bug fix? The fix should be present in the latest RHEL5.3 Snapshot. Please test and send feedback ASAP.
Apologies, this fix should be present in the Snapshot 5, which is scheduled for release next week.
~~ Snapshot 5 is now available @ partners.redhat.com ~~
Partners, RHEL 5.3 Snapshot 5 is now available for testing. Please send us your testing feedback on this important bug fix / feature request AS SOON AS POSSIBLE. If you are unable to test, indicate this in a comment or escalate to your Partner Manager. If we do not receive your test feedback, this bug will be AT RISK of being dropped from the release.
If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla
Keywords field, along with a description of the test results.
If you encounter a new bug, CLONE this bug and request from your Partner
manager to review. We are no longer excepting new bugs into the release, bar
Brocade, any update?
~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible. If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there.
I loaded the 5.3 kernel on the same system and ran the certification test twice. I did not receive any EDAC errors in dmesg.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
The following bug description for this issue was too long to include in the errata:
The i5000_edac module reported all types of errors (including errors not
completely documented) and did not use edac_mc functions to report errors.
Not using the edac_mc functions prevented the error messages from being
filtered or silenced. On certain systems, this resulted in the console
being flooded with errors, for example:
EDAC i5000 MC0: NON-FATAL ERROR Found!!! 1st NON-FATAL Err Reg= [hex value]
EDAC i5000: NON-Retry Errors, bits= [hex value]
Removing the i5000_edac module prevented these errors; however, it may have
prevented other important messages from being reported. After installing
an update, the i5000_edac module uses the edac_mc functions to report
errors, which resolves this issue.
Note: After an update, the i5000_edac module will not report errors that are
not completely documented: these will be disabled by default. To re-enable
these messages, use the i5000_edac "misc_messages=1" module parameter.