Bug 533391
| Summary: | Kernel panic: EDAC MC0: INTERNAL ERROR: channel-b out of range | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Tamas Vincze <tom> | ||||||||
| Component: | kernel | Assignee: | Mauro Carvalho Chehab <mchehab> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | chen yuwen <yuchen> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | low | ||||||||||
| Version: | 5.4 | CC: | clalance, czhang, lwang, pbonzini, qcai, syeghiay, xen-maint | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 570833 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2011-01-13 20:54:49 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 570833 | ||||||||||
| Attachments: |
|
||||||||||
The bug is probably in the edac_mc.c function: i5000_process_nonfatal_error_info()
The i5000X data sheet says that FERR_NF_FBD bit 28 has no significance for M4Err through M12Err (=FERR_NF_UNCORRECTABLE bits). (it's on page 211 of the PDF)
The channel number is taken from bits 29:28 by this macro:
EXTRACT_FBDCHAN_INDX(x) (((x)>>28) & 0x3)
ue_errors = allErrors & FERR_NF_UNCORRECTABLE;
if (ue_errors) {
debugf0("\tUncorrected bits= 0x%x\n", ue_errors);
branch = EXTRACT_FBDCHAN_INDX(info->ferr_nf_fbd);
channel = branch;
[...]
/* Call the helper to output message */
edac_mc_handle_fbd_ue(mci, rank, channel, channel + 1, msg);
}
If both of bits 29:28 are set then channel+1 in the last line above yields 4, which is out of range.
The fix would be to replace the line:
channel = branch;
with:
channel = branch & 2;
This doesn't seem to be related to kernel-xen. Moving component to kernel. Also submitted as: http://bugzilla.kernel.org/show_bug.cgi?id=14568 Created attachment 377139 [details]
Fix i5000 error when reporting the first fatal errors (FERR_FAT_FBD)
As reported, bit 28 of FERR_FAT_FBD is not reported properly by the chipset. Due to that, it may happen that both bits 28 and 29 to be one, giving an out-of-range value.
This patch fixes it on RHEL5 kernel.
The bug is also present upstream. I've sent a patch upstream for it.
(In reply to comment #4) > Created an attachment (id=377139) [details] > Fix i5000 error when reporting the first fatal errors (FERR_FAT_FBD) > > As reported, bit 28 of FERR_FAT_FBD is not reported properly by the chipset. In time: The issue is at FERR_NF_FBD (the first non-fatal error). The patch is correct, I just made a typo on comment #4. Let's wait for upstream to commit it, before adding it on RHEL5. Patch were committed upstream on Jan, 15 at changeset 118f3e1afd5534c15f9701f33514186cfc841a27. Patch posted at the ML. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-200.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Created attachment 468761 [details]
i5000p EDAC dmesg
Can not reproduce on dell-pem600-01.rhts.eng.bos.redhat.com with i5000 chipset. No panic. # dmesg ... EDAC MC: Ver: 2.0.1 Aug 18 2009 EDAC MC0: Giving out device to i5000_edac.c I5000: DEV 0000:00:10.0 ... Any additional operations to reproduce it? Or need specified machine? I reported this bug originally. I've been running Jarod's patched kernel from Comment #10 since he published it and the box hasn't panicked since. Previously it locked up every month or so. It looks to me that the patch solved this issue and should be included in a maintenance release. Hi Chen, (In reply to comment #13) > Can not reproduce on dell-pem600-01.rhts.eng.bos.redhat.com with i5000 chipset. > No panic. > # dmesg > ... > EDAC MC: Ver: 2.0.1 Aug 18 2009 > EDAC MC0: Giving out device to i5000_edac.c I5000: DEV 0000:00:10.0 > ... > > Any additional operations to reproduce it? Or need specified machine? This bug happens only when the EDAC driver detects a corrected error on an ECC memory. As Tamas pointed, an error at the memories is a rare event. It may take a long time for this bug to happen (several weeks or even more). The probability of a memory error is function of the presence of solar stoms, the memory temperature and other environmental factors. If you have physical access to the machine, you may be able to force memory errors by heating the memory chips with a hair dryer, but be careful to not permanently damaging it. Yet, the bug is pretty obvious: the but is reported as happening on two channels (since hardware can't actually distinguish between the two channels at the error report): edac_mc_handle_fbd_ue(mci, rank, channel, channel + 1, msg); So, an error can be either at channels 0/1 or at channels 2/3. If we don't do channel & 2, the driver may try to report a bug on a non-existing channel (this device doesn't have channel 4). According to customer feedback in comment #14, confirmed patch in kernel 2.6.18-200.el5. Setting SanityOnly. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |
Created attachment 367831 [details] Full console log EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4) Kernel panic - not syncing: EDAC MC0: Uncorrected Error (XEN) Domain 0 crashed: 'noreboot' set - not rebooting. Kernel is 2.6.18-164.el5xen Server board is a Supermicro X7DBi+ (Intel 5000P chipset) with 16GB RAM and two quad-core Xeon CPUs. This is the first time it happened; server was rebooted about a week ago.