Bug 471933 - [Brocade/Dell 5.3 bug] hts failing memory test with EDAC i5000 Non-Fatal error
Summary: [Brocade/Dell 5.3 bug] hts failing memory test with EDAC i5000 Non-Fatal error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: i686
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Aristeu Rozanski
QA Contact: Martin Jenner
URL:
Whiteboard:
: 458133 (view as bug list)
Depends On:
Blocks: 494734
TreeView+ depends on / blocked
 
Reported: 2008-11-17 18:50 UTC by Janice Vatcher
Modified: 2018-10-19 23:49 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 494734 (view as bug list)
Environment:
Last Closed: 2009-01-20 19:45:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Here is the message file. (68.97 KB, text/plain)
2008-11-17 19:46 UTC, Janice Vatcher
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Janice Vatcher 2008-11-17 18:50:11 UTC
Description of problem:
Failing memory test in HTS. 

Error in messages
Nov 16 04:44:02 hba081036 kernel: EDAC i5000:   NON-Retry  Errors, bits= 0x800


Version-Release number of selected component (if applicable):


How reproducible:
Run hts against the system

Steps to Reproduce:
1. run hts discover
2. run hts certify
3.
  
Actual results: 
memory - FAIL


Expected results:
memory - PASS

Additional info:

Comment 1 Andrius Benokraitis 2008-11-17 19:32:29 UTC
Janice - do you have any logs we can look at?

Comment 2 Rob Landry 2008-11-17 19:40:57 UTC
The error in description looks to be coming from the kernel, either the VM or the driver.  While I could see how hts might aggravate whatever the problem is I would not suspect it the root cause.  Reassigning to kernel, for their assessment (hts or system logs should assist in their review as well).

Comment 3 Janice Vatcher 2008-11-17 19:46:26 UTC
Created attachment 323790 [details]
Here is the message file.

Andrius
Here is the messages file from the server.
Janice

Comment 4 Andrius Benokraitis 2008-11-17 20:03:29 UTC
Janice - can you post the HTS logs? It can be captured in the INFO test.

Comment 5 Milan Broz 2008-11-20 15:22:43 UTC
I see this on my Dell 490 too, I'll try to fiddle with memory banks but I expect it is just too verbose message...

Console is flooded with
EDAC i5000 MC0: NON-FATAL ERROR Found!!! 1st NON-FATAL Err Reg= 0x800
EDAC i5000: NON-Retry  Errors, bits=  0x800

Linux 2.6.18-123.el5xen #1 SMP Mon Nov 10 18:45:33 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

Seems to similar to bug #458133 ?

Comment 6 Aristeu Rozanski 2008-11-21 13:39:54 UTC
Yes, please test the kernel package available at:
http://people.redhat.com/arozansk/bz458133/

Comment 7 Milan Broz 2008-11-21 15:04:04 UTC
Yes, 2.6.18-120.el5.458133xe is ok here, no messages.

The message mentioned in comment #5 repeats every second on non-patched kernel, it makes the physical console mostly unusable or it floods logs at least.

Please consider this as blocker for RHEL5.3...

My HW is standard Dell Precision 690 workstation.

Comment 12 Aristeu Rozanski 2008-11-21 18:31:06 UTC
*** Bug 458133 has been marked as a duplicate of this bug. ***

Comment 14 Don Zickus 2008-12-02 22:20:22 UTC
in kernel-2.6.18-125.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 16 Chris Ward 2008-12-04 10:22:56 UTC
Brocade, what is the current status of this bug fix? The fix should be present in the latest RHEL5.3 Snapshot. Please test and send feedback ASAP.

Comment 17 Chris Ward 2008-12-04 15:46:52 UTC
Apologies, this fix should be present in the Snapshot 5, which is scheduled for release next week.

Comment 18 Chris Ward 2008-12-08 11:53:28 UTC
~~ Snapshot 5 is now available @ partners.redhat.com ~~ 

Partners, RHEL 5.3 Snapshot 5 is now available for testing. Please send us your testing feedback on this important bug fix / feature request AS SOON AS POSSIBLE. If you are unable to test, indicate this in a comment or escalate to your Partner Manager. If we do not receive your test feedback, this bug will be AT RISK of being dropped from the release.

If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla
Keywords field, along with a description of the test results. 

If you encounter a new bug, CLONE this bug and request from your Partner
manager to review. We are no longer excepting new bugs into the release, bar
critical regressions.

Comment 19 Chris Ward 2008-12-11 18:01:11 UTC
Brocade, any update?

Comment 21 Chris Ward 2008-12-16 16:29:30 UTC
~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible.  If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there.

Comment 22 Janice Vatcher 2008-12-19 22:24:32 UTC
I loaded the 5.3 kernel on the same system and ran the certification test twice. I did not receive any EDAC errors in dmesg.

Comment 24 errata-xmlrpc 2009-01-20 19:45:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Comment 26 Murray McAllister 2009-05-20 01:55:57 UTC
The following bug description for this issue was too long to include in the errata:

The i5000_edac module reported all types of errors (including errors not
completely documented) and did not use edac_mc functions to report errors.
Not using the edac_mc functions prevented the error messages from being
filtered or silenced. On certain systems, this resulted in the console
being flooded with errors, for example:

EDAC i5000 MC0: NON-FATAL ERROR Found!!! 1st NON-FATAL Err Reg= [hex value]

EDAC i5000: NON-Retry Errors, bits= [hex value]

Removing the i5000_edac module prevented these errors; however, it may have
prevented other important messages from being reported. After installing
an update, the i5000_edac module uses the edac_mc functions to report
errors, which resolves this issue.

Note: After an update, the i5000_edac module will not report errors that are
not completely documented: these will be disabled by default. To re-enable
these messages, use the i5000_edac "misc_messages=1" module parameter.


Note You need to log in before you can comment on or make changes to this bug.