Description of problem: Failing memory test in HTS. Error in messages Nov 16 04:44:02 hba081036 kernel: EDAC i5000: NON-Retry Errors, bits= 0x800 Version-Release number of selected component (if applicable): How reproducible: Run hts against the system Steps to Reproduce: 1. run hts discover 2. run hts certify 3. Actual results: memory - FAIL Expected results: memory - PASS Additional info:
Janice - do you have any logs we can look at?
The error in description looks to be coming from the kernel, either the VM or the driver. While I could see how hts might aggravate whatever the problem is I would not suspect it the root cause. Reassigning to kernel, for their assessment (hts or system logs should assist in their review as well).
Created attachment 323790 [details] Here is the message file. Andrius Here is the messages file from the server. Janice
Janice - can you post the HTS logs? It can be captured in the INFO test.
I see this on my Dell 490 too, I'll try to fiddle with memory banks but I expect it is just too verbose message... Console is flooded with EDAC i5000 MC0: NON-FATAL ERROR Found!!! 1st NON-FATAL Err Reg= 0x800 EDAC i5000: NON-Retry Errors, bits= 0x800 Linux 2.6.18-123.el5xen #1 SMP Mon Nov 10 18:45:33 EST 2008 x86_64 x86_64 x86_64 GNU/Linux Seems to similar to bug #458133 ?
Yes, please test the kernel package available at: http://people.redhat.com/arozansk/bz458133/
Yes, 2.6.18-120.el5.458133xe is ok here, no messages. The message mentioned in comment #5 repeats every second on non-patched kernel, it makes the physical console mostly unusable or it floods logs at least. Please consider this as blocker for RHEL5.3... My HW is standard Dell Precision 690 workstation.
*** Bug 458133 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-125.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Brocade, what is the current status of this bug fix? The fix should be present in the latest RHEL5.3 Snapshot. Please test and send feedback ASAP.
Apologies, this fix should be present in the Snapshot 5, which is scheduled for release next week.
~~ Snapshot 5 is now available @ partners.redhat.com ~~ Partners, RHEL 5.3 Snapshot 5 is now available for testing. Please send us your testing feedback on this important bug fix / feature request AS SOON AS POSSIBLE. If you are unable to test, indicate this in a comment or escalate to your Partner Manager. If we do not receive your test feedback, this bug will be AT RISK of being dropped from the release. If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla Keywords field, along with a description of the test results. If you encounter a new bug, CLONE this bug and request from your Partner manager to review. We are no longer excepting new bugs into the release, bar critical regressions.
Brocade, any update?
~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible. If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there.
I loaded the 5.3 kernel on the same system and ran the certification test twice. I did not receive any EDAC errors in dmesg.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html
The following bug description for this issue was too long to include in the errata: The i5000_edac module reported all types of errors (including errors not completely documented) and did not use edac_mc functions to report errors. Not using the edac_mc functions prevented the error messages from being filtered or silenced. On certain systems, this resulted in the console being flooded with errors, for example: EDAC i5000 MC0: NON-FATAL ERROR Found!!! 1st NON-FATAL Err Reg= [hex value] EDAC i5000: NON-Retry Errors, bits= [hex value] Removing the i5000_edac module prevented these errors; however, it may have prevented other important messages from being reported. After installing an update, the i5000_edac module uses the edac_mc functions to report errors, which resolves this issue. Note: After an update, the i5000_edac module will not report errors that are not completely documented: these will be disabled by default. To re-enable these messages, use the i5000_edac "misc_messages=1" module parameter.