Bug 606687

Summary: HARDWARE ERROR on intel-sunriseridge-01 when unloading igb
Product: Red Hat Enterprise Linux 6 Reporter: Stefan Assmann <sassmann>
Component: kernelAssignee: Stefan Assmann <sassmann>
Status: CLOSED CURRENTRELEASE QA Contact: Network QE <network-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: agospoda, alexander.h.duyck, hjia, jane.lv, jlv, john.ronciak, jvillalo, keve.a.gabbert, kzhang, luyu, maciej.sosnowski, prarit
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-11 16:16:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580574    
Attachments:
Description Flags
patch with workaround (based on kernel 2.6.32) none

Description Stefan Assmann 2010-06-22 08:56:47 UTC
Description of problem:

After unloading igb twice the machine shows "HARDWARE ERROR"

root.bos.redhat.com:~> modprobe -r igb ; sleep 3 ; modprobe igb
root.bos.redhat.com:~> modprobe -r igb ; sleep 3 ; modprobe igb

HARDWARE ERROR
CPU 38: Machine Check Exception:                5 Bank 0: fa00000000400e0f
RIP !INEXACT! 10:<ffffffff8101bbd1> {mwait_idle+0x71/0xd0}
TSC f272938c47 MISC 1
PROCESSOR 0:206e6 TIME 1277196714 SOCKET 1 APIC 23
No human readable MCE decoding support on this CPU type.
Run the message through 'mcelog --ascii' to decode.
CPU 38: Machine Check Exception:                5 Bank 1: fa00000000400e0f
RIP !INEXACT! 10:<ffffffff8101bbd1> {mwait_idle+0x71/0xd0}
TSC f272938c47 MISC 1
PROCESSOR 0:206e6 TIME 1277196714 SOCKET 1 APIC 23
No human readable MCE decoding support on this CPU type.
Run the message through 'mcelog --ascii' to decode.
CPU 30: Machine Check Exception:                5 Bank 0: fa00000000400e0f
RIP !INEXACT! 10:<ffffffff8101bbd1> {mwait_idle+0x71/0xd0}
TSC f272938b81 MISC 1
PROCESSOR 0:206e6 TIME 1277196714 SOCKET 1 APIC 36


Version-Release number of selected component (if applicable):
2.6.32-36.el6.x86_64

How reproducible:
always

Comment 2 RHEL Program Management 2010-06-22 09:23:15 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Stefan Assmann 2010-06-22 10:57:11 UTC
reproduced with
- vanilla 2.6.33
- vanilla 2.6.34

PROCESSOR 0:206e6 TIME 1277204113 SOCKET 0 APIC 0
No human readable MCE decoding support on this CPU type.
Run the message through 'mcelog --ascii' to decode.
This is not a software problem!
Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal Machine check
Pid: 0, comm: swapper Tainted: G   M       2.6.34 #3
Call Trace:
 <#MC>  [<ffffffff8149bb2d>] panic+0x7d/0xfe
 [<ffffffff8101e092>] mce_panic+0x1e2/0x210
 [<ffffffff8101f8a8>] do_machine_check+0xa28/0xa70
 [<ffffffff8101423f>] ? mwait_idle+0x6f/0xd0
 [<ffffffff8149ee5c>] machine_check+0x1c/0x30
 [<ffffffff8101423f>] ? mwait_idle+0x6f/0xd0
 <<EOE>>  [<ffffffff81009dc6>] cpu_idle+0xb6/0x110
 [<ffffffff81495c1b>] start_secondary+0x25d/0x2a0
Rebooting in 30 seconds..

Forgot to mention the trace seen here in the initial report

Comment 4 Stefan Assmann 2010-06-22 13:31:05 UTC
just confirmed that it works on RHEL5 by trying 10x
modprobe -r igb ; sleep 3 ; modprobe igb

Ccing Alex

Comment 5 Stefan Assmann 2010-06-23 15:28:32 UTC
looks like another DCA issue, the module (un)loading succeeds when I blacklist ioatdma.

Comment 6 Stefan Assmann 2010-06-24 09:16:47 UTC
Maciej,

looks like another DCA problem, could you look into it?

Comment 7 maciej.sosnowski 2010-06-25 15:29:52 UTC
Yes. We will try to reproduce it locally.
In the meantime I have informed Sunrise Ridge team about this issue + filed a bug in their database.

As I understand, this issue is observed on Sunrise Ridge, not Emerald Ridge - could you confirm? Thanks.

Comment 8 Stefan Assmann 2010-06-26 09:38:52 UTC
(In reply to comment #7)
> As I understand, this issue is observed on Sunrise Ridge, not Emerald Ridge -
> could you confirm? Thanks.

confirmed!

Comment 9 maciej.sosnowski 2010-07-08 16:12:00 UTC
Created attachment 430414 [details]
patch with workaround (based on kernel 2.6.32)

I am attaching a patch with proposed workaround for the issue.
The patch is based on kernel 2.6.32.
The workaround should work for both:
- Bug 572732: Unloading igb module causes system reset,
- Bug 606687: HARDWARE ERROR on intel-sunriseridge-01 when unloading igb.
Please let me know if it works on your side. Thanks.

Comment 10 Stefan Assmann 2010-07-09 09:49:06 UTC
Hi Maciej,

works great on intel-sunriseridge-01 so far! reloaded igb 20x without a problem. Care to explain what was going wrong?

Comment 11 maciej.sosnowski 2010-07-09 13:35:08 UTC
Good news. Thanks.
The patch is actually a workaround. To avoid platform reset/MCE dca module blocks dca providers if an Emerald Ridge / Sunrise Ridge platform is detected.

Comment 12 Stefan Assmann 2010-07-09 15:29:55 UTC
Is this the code we're going to see upstream? It would be good to have a upstream reference to get it included into RHEL6.0.

Comment 14 maciej.sosnowski 2010-07-19 13:56:38 UTC
The problem has not been fully root caused yet. Please include the workaround patch in your kernel to avoid the problem with RHEL6.
Once we have root caused this issue we will provide appropriate solution if any needed to Red Hat and upstream kernel.

Comment 15 Aristeu Rozanski 2010-07-26 14:59:45 UTC
Patch(es) available on kernel-2.6.32-52.el6

Comment 18 Hushan Jia 2010-08-16 08:27:58 UTC
Reproduced on -44 kernel, HARDWARE ERROR when unload igb, verified on -63, no crash, igb works fine.

Comment 19 Stefan Assmann 2010-08-22 18:14:35 UTC
*** Bug 624602 has been marked as a duplicate of this bug. ***

Comment 20 releng-rhel@redhat.com 2010-11-11 16:16:03 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.