Bug 606687 - HARDWARE ERROR on intel-sunriseridge-01 when unloading igb
HARDWARE ERROR on intel-sunriseridge-01 when unloading igb
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: Stefan Assmann
Network QE
:
: 624602 (view as bug list)
Depends On:
Blocks: 580574
  Show dependency treegraph
 
Reported: 2010-06-22 04:56 EDT by Stefan Assmann
Modified: 2010-11-11 11:16 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-11 11:16:03 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch with workaround (based on kernel 2.6.32) (3.97 KB, patch)
2010-07-08 12:12 EDT, maciej.sosnowski
no flags Details | Diff

  None (edit)
Description Stefan Assmann 2010-06-22 04:56:47 EDT
Description of problem:

After unloading igb twice the machine shows "HARDWARE ERROR"

root@intel-sunriseridge-01.lab.bos.redhat.com:~> modprobe -r igb ; sleep 3 ; modprobe igb
root@intel-sunriseridge-01.lab.bos.redhat.com:~> modprobe -r igb ; sleep 3 ; modprobe igb

HARDWARE ERROR
CPU 38: Machine Check Exception:                5 Bank 0: fa00000000400e0f
RIP !INEXACT! 10:<ffffffff8101bbd1> {mwait_idle+0x71/0xd0}
TSC f272938c47 MISC 1
PROCESSOR 0:206e6 TIME 1277196714 SOCKET 1 APIC 23
No human readable MCE decoding support on this CPU type.
Run the message through 'mcelog --ascii' to decode.
CPU 38: Machine Check Exception:                5 Bank 1: fa00000000400e0f
RIP !INEXACT! 10:<ffffffff8101bbd1> {mwait_idle+0x71/0xd0}
TSC f272938c47 MISC 1
PROCESSOR 0:206e6 TIME 1277196714 SOCKET 1 APIC 23
No human readable MCE decoding support on this CPU type.
Run the message through 'mcelog --ascii' to decode.
CPU 30: Machine Check Exception:                5 Bank 0: fa00000000400e0f
RIP !INEXACT! 10:<ffffffff8101bbd1> {mwait_idle+0x71/0xd0}
TSC f272938b81 MISC 1
PROCESSOR 0:206e6 TIME 1277196714 SOCKET 1 APIC 36


Version-Release number of selected component (if applicable):
2.6.32-36.el6.x86_64

How reproducible:
always
Comment 2 RHEL Product and Program Management 2010-06-22 05:23:15 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 3 Stefan Assmann 2010-06-22 06:57:11 EDT
reproduced with
- vanilla 2.6.33
- vanilla 2.6.34

PROCESSOR 0:206e6 TIME 1277204113 SOCKET 0 APIC 0
No human readable MCE decoding support on this CPU type.
Run the message through 'mcelog --ascii' to decode.
This is not a software problem!
Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal Machine check
Pid: 0, comm: swapper Tainted: G   M       2.6.34 #3
Call Trace:
 <#MC>  [<ffffffff8149bb2d>] panic+0x7d/0xfe
 [<ffffffff8101e092>] mce_panic+0x1e2/0x210
 [<ffffffff8101f8a8>] do_machine_check+0xa28/0xa70
 [<ffffffff8101423f>] ? mwait_idle+0x6f/0xd0
 [<ffffffff8149ee5c>] machine_check+0x1c/0x30
 [<ffffffff8101423f>] ? mwait_idle+0x6f/0xd0
 <<EOE>>  [<ffffffff81009dc6>] cpu_idle+0xb6/0x110
 [<ffffffff81495c1b>] start_secondary+0x25d/0x2a0
Rebooting in 30 seconds..

Forgot to mention the trace seen here in the initial report
Comment 4 Stefan Assmann 2010-06-22 09:31:05 EDT
just confirmed that it works on RHEL5 by trying 10x
modprobe -r igb ; sleep 3 ; modprobe igb

Ccing Alex
Comment 5 Stefan Assmann 2010-06-23 11:28:32 EDT
looks like another DCA issue, the module (un)loading succeeds when I blacklist ioatdma.
Comment 6 Stefan Assmann 2010-06-24 05:16:47 EDT
Maciej,

looks like another DCA problem, could you look into it?
Comment 7 maciej.sosnowski 2010-06-25 11:29:52 EDT
Yes. We will try to reproduce it locally.
In the meantime I have informed Sunrise Ridge team about this issue + filed a bug in their database.

As I understand, this issue is observed on Sunrise Ridge, not Emerald Ridge - could you confirm? Thanks.
Comment 8 Stefan Assmann 2010-06-26 05:38:52 EDT
(In reply to comment #7)
> As I understand, this issue is observed on Sunrise Ridge, not Emerald Ridge -
> could you confirm? Thanks.

confirmed!
Comment 9 maciej.sosnowski 2010-07-08 12:12:00 EDT
Created attachment 430414 [details]
patch with workaround (based on kernel 2.6.32)

I am attaching a patch with proposed workaround for the issue.
The patch is based on kernel 2.6.32.
The workaround should work for both:
- Bug 572732: Unloading igb module causes system reset,
- Bug 606687: HARDWARE ERROR on intel-sunriseridge-01 when unloading igb.
Please let me know if it works on your side. Thanks.
Comment 10 Stefan Assmann 2010-07-09 05:49:06 EDT
Hi Maciej,

works great on intel-sunriseridge-01 so far! reloaded igb 20x without a problem. Care to explain what was going wrong?
Comment 11 maciej.sosnowski 2010-07-09 09:35:08 EDT
Good news. Thanks.
The patch is actually a workaround. To avoid platform reset/MCE dca module blocks dca providers if an Emerald Ridge / Sunrise Ridge platform is detected.
Comment 12 Stefan Assmann 2010-07-09 11:29:55 EDT
Is this the code we're going to see upstream? It would be good to have a upstream reference to get it included into RHEL6.0.
Comment 14 maciej.sosnowski 2010-07-19 09:56:38 EDT
The problem has not been fully root caused yet. Please include the workaround patch in your kernel to avoid the problem with RHEL6.
Once we have root caused this issue we will provide appropriate solution if any needed to Red Hat and upstream kernel.
Comment 15 Aristeu Rozanski 2010-07-26 10:59:45 EDT
Patch(es) available on kernel-2.6.32-52.el6
Comment 18 Hushan Jia 2010-08-16 04:27:58 EDT
Reproduced on -44 kernel, HARDWARE ERROR when unload igb, verified on -63, no crash, igb works fine.
Comment 19 Stefan Assmann 2010-08-22 14:14:35 EDT
*** Bug 624602 has been marked as a duplicate of this bug. ***
Comment 20 releng-rhel@redhat.com 2010-11-11 11:16:03 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.