Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 533391

Summary:

Kernel panic: EDAC MC0: INTERNAL ERROR: channel-b out of range

Product:

Red Hat Enterprise Linux 5

Reporter:

Tamas Vincze <tom>

Component:

kernel

Assignee:

Mauro Carvalho Chehab <mchehab>

Status:

CLOSED ERRATA

QA Contact:

chen yuwen <yuchen>

Severity:

high

Docs Contact:

Priority:

low

Version:

5.4

CC:

clalance, czhang, lwang, pbonzini, qcai, syeghiay, xen-maint

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

570833 (view as bug list)

Environment:

Last Closed:

2011-01-13 20:54:49 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

570833

Attachments:

Description	Flags
Full console log	none
Fix i5000 error when reporting the first fatal errors (FERR_FAT_FBD)	none
i5000p EDAC dmesg	none

Description Tamas Vincze 2009-11-06 15:10:26 UTC

Created attachment 367831 [details]
Full console log

EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)
Kernel panic - not syncing: EDAC MC0: Uncorrected Error
 (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

Kernel is 2.6.18-164.el5xen
Server board is a Supermicro X7DBi+ (Intel 5000P chipset) with 16GB RAM and two quad-core Xeon CPUs.

This is the first time it happened; server was rebooted about a week ago.

Comment 1 Tamas Vincze 2009-11-06 17:05:29 UTC

The bug is probably in the edac_mc.c function: i5000_process_nonfatal_error_info()

The i5000X data sheet says that FERR_NF_FBD bit 28 has no significance for M4Err through M12Err (=FERR_NF_UNCORRECTABLE bits). (it's on page 211 of the PDF)
The channel number is taken from bits 29:28 by this macro:
EXTRACT_FBDCHAN_INDX(x) (((x)>>28) & 0x3)

ue_errors = allErrors & FERR_NF_UNCORRECTABLE;
if (ue_errors) {
    debugf0("\tUncorrected bits= 0x%x\n", ue_errors);
    branch = EXTRACT_FBDCHAN_INDX(info->ferr_nf_fbd);
    channel = branch;
    [...]
    /* Call the helper to output message */
    edac_mc_handle_fbd_ue(mci, rank, channel, channel + 1, msg);
}

If both of bits 29:28 are set then channel+1 in the last line above yields 4, which is out of range.

The fix would be to replace the line:
    channel = branch;
with:
    channel = branch & 2;

Comment 2 Paolo Bonzini 2009-11-09 09:03:15 UTC

This doesn't seem to be related to kernel-xen.  Moving component to kernel.

Comment 3 Tamas Vincze 2009-11-09 15:40:10 UTC

Also submitted as: http://bugzilla.kernel.org/show_bug.cgi?id=14568

Comment 4 Mauro Carvalho Chehab 2009-12-09 11:33:30 UTC

Created attachment 377139 [details]
Fix i5000 error when reporting the first fatal errors (FERR_FAT_FBD)

As reported, bit 28 of FERR_FAT_FBD is not reported properly by the chipset. Due to that, it may happen that both bits 28 and 29 to be one, giving an out-of-range value.

This patch fixes it on RHEL5 kernel.

The bug is also present upstream. I've sent a patch upstream for it.

Comment 5 Mauro Carvalho Chehab 2009-12-09 11:56:28 UTC

(In reply to comment #4)
> Created an attachment (id=377139) [details]
> Fix i5000 error when reporting the first fatal errors (FERR_FAT_FBD)
> 
> As reported, bit 28 of FERR_FAT_FBD is not reported properly by the chipset.

In time: 
The issue is at FERR_NF_FBD (the first non-fatal error). The patch is correct,
I just made a typo on comment #4.

Let's wait for upstream to commit it, before adding it on RHEL5.

Comment 6 Mauro Carvalho Chehab 2010-01-18 14:14:41 UTC

Patch were committed upstream on Jan, 15 at changeset 118f3e1afd5534c15f9701f33514186cfc841a27.

Patch posted at the ML.

Comment 8 RHEL Program Management 2010-05-20 12:41:58 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Jarod Wilson 2010-05-25 21:10:35 UTC

in kernel-2.6.18-200.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 12 chen yuwen 2010-12-15 03:51:54 UTC

Created attachment 468761 [details]
i5000p EDAC dmesg

Comment 13 chen yuwen 2010-12-15 04:00:54 UTC

Can not reproduce on dell-pem600-01.rhts.eng.bos.redhat.com with i5000 chipset. No panic.
# dmesg
...
EDAC MC: Ver: 2.0.1 Aug 18 2009
EDAC MC0: Giving out device to i5000_edac.c I5000: DEV 0000:00:10.0
...

Any additional operations to reproduce it? Or need specified machine?

Comment 14 Tamas Vincze 2010-12-15 15:59:03 UTC

I reported this bug originally. I've been running Jarod's patched kernel from Comment #10 since he published it and the box hasn't panicked since. Previously it locked up every month or so. It looks to me that the patch solved this issue and should be included in a maintenance release.

Comment 15 Mauro Carvalho Chehab 2010-12-15 16:39:35 UTC

Hi Chen,

(In reply to comment #13)
> Can not reproduce on dell-pem600-01.rhts.eng.bos.redhat.com with i5000 chipset.
> No panic.
> # dmesg
> ...
> EDAC MC: Ver: 2.0.1 Aug 18 2009
> EDAC MC0: Giving out device to i5000_edac.c I5000: DEV 0000:00:10.0
> ...
> 
> Any additional operations to reproduce it? Or need specified machine?

This bug happens only when the EDAC driver detects a corrected error on an ECC memory. 

As Tamas pointed, an error at the memories is a rare event. It may take a long time for this bug to happen (several weeks or even more). The probability of a memory error is function of the presence of solar stoms, the memory temperature
and other environmental factors. If you have physical access to the machine, you
may be able to force memory errors by heating the memory chips with a hair dryer, but be careful to not permanently damaging it.

Yet, the bug is pretty obvious: the but is reported as happening on two channels (since hardware can't actually distinguish between the two channels at the error report):
	edac_mc_handle_fbd_ue(mci, rank, channel, channel + 1, msg);

So, an error can be either at channels 0/1 or at channels 2/3.

If we don't do channel & 2, the driver may try to report a bug on a non-existing
channel (this device doesn't have channel 4).

Comment 16 chen yuwen 2010-12-16 03:37:58 UTC

According to customer feedback in comment #14, confirmed patch in kernel 2.6.18-200.el5.
Setting SanityOnly.

Comment 19 errata-xmlrpc 2011-01-13 20:54:49 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html