Bug 580425 - Host kernel panic when boot with "intel_iommu=on" in the kernel line
Summary: Host kernel panic when boot with "intel_iommu=on" in the kernel line
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
high
medium
Target Milestone: rc
: ---
Assignee: Don Dugger
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 514490
TreeView+ depends on / blocked
 
Reported: 2010-04-08 08:58 UTC by Qunfang Zhang
Modified: 2014-07-25 03:22 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-11 01:07:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Call trace (27.73 KB, text/plain)
2010-04-08 08:58 UTC, Qunfang Zhang
no flags Details
lspci -vvv using the given lspci binary (74.16 KB, text/plain)
2010-04-13 04:46 UTC, Qunfang Zhang
no flags Details
with_debug_in_cmdline_20101209 (33.22 KB, text/plain)
2010-12-09 09:11 UTC, Qunfang Zhang
no flags Details

Description Qunfang Zhang 2010-04-08 08:58:56 UTC
Created attachment 405235 [details]
Call trace

Description of problem:
Call trace on host when boot with "intel_iommu=on" in the kernel line.
I use the Intel Z800 host. 
Install tree:
nfs.englab.nay.redhat.com --dir=/pub/rhel/released/RHEL-5-Server/U5/x86_64/os

Version-Release number of selected component (if applicable):
2.6.18-194.el5

How reproducible:
100%

Steps to Reproduce:
1.Install RHEL5.5 OS on host using the install tree above.
2.After installation, add "intel_iommu=on" to the host kernel line.
3.Reboot host.
  
Actual results:
Call trace on host. (Attachment will be upload.)

Expected results:
Host can boot up successfully.

Additional info:
kernel command line:
ro root=/dev/VolGroup00/LogVol00 intel_iommu=on


#cat /proc/cpuinfo (here only list the last cpu)
processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 1596.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips	: 5333.40
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: [8]

Comment 1 Qunfang Zhang 2010-04-08 09:00:48 UTC
And I have enabled VT-d and VT-x in the BIOS.

# dmidecode -t 0
# dmidecode 2.10
SMBIOS 2.6 present.

Handle 0x0001, DMI type 0, 24 bytes
BIOS Information
	Vendor: Hewlett-Packard
	Version: 786G5 v01.17
	Release Date: 08/19/2009
	Address: 0xE0000
	Runtime Size: 128 kB
	ROM Size: 2048 kB
	Characteristics:
		PCI is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		EDD is supported
		Japanese floppy for Toshiba 1.2 MB is supported (int 13h)
		3.5"/720 kB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		ACPI is supported
		USB legacy is supported
		LS-120 boot is supported
		ATAPI Zip drive boot is supported
		BIOS boot specification is supported
		Function key-initiated network boot is supported
		Targeted content distribution is supported
	BIOS Revision: 1.17

Comment 2 Qunfang Zhang 2010-04-08 09:46:24 UTC
Downgrade kernel to 2.6.18-189.el5, this issue still exists.

Comment 3 Qunfang Zhang 2010-04-09 05:12:34 UTC
After remove the Intel 82576 NIC card and boot the host with "intel_iommu=on" again, host boot up successfully.
I will change another 82576 card to have a try.

Comment 4 Qunfang Zhang 2010-04-12 10:23:55 UTC
I re-test the bug with the following conditions, and paste the results here:
And there are two NIC cards plugged in the PCI slots.One is a 82576(2 ports) and another is 82572.

#lspci | grep Etherent
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
1c:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
1c:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
28:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)

01:00.0 and 02:00.0 are devices in the mainboard.

1. Remove 82572EI from PCIe slot, boot host with intel_iommu=on, PASS.
2. Plug the 82572EI to another PCIe slot, boot host with intel_iommu=on. PASS.
3. Plug the 82572EI to the original PCIe slot, boot host with intel_iommu=on. ==> *kernel oops*.
4. Without intel_iommu=on in the kernel line, all operations above work well. PASS.

#lspci -vvv -s 28:00.0
28:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)
	Subsystem: Intel Corporation PRO/1000 PT Desktop Adapter
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 59
	Region 0: Memory at e3100000 (32-bit, non-prefetchable) [size=128K]
	Region 1: Memory at e3120000 (32-bit, non-prefetchable) [size=128K]
	Region 2: I/O ports at d000 [size=32]
	[virtual] Expansion ROM at e3f00000 [disabled] [size=128K]
	Capabilities: [c8] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [d0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
		Address: 00000000fee00000  Data: 403b
	Capabilities: [e0] Express Endpoint IRQ 0
		Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag-
		Device: Latency L0s <512ns, L1 <64us
		Device: AtnBtn- AtnInd- PwrInd-
		Device: Errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
		Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
		Device: MaxPayload 256 bytes, MaxReadReq 512 bytes
		Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
		Link: Latency L0s <4us, L1 <64us
		Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
		Link: Speed 2.5Gb/s, Width x1
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 67-26-5c-ff-ff-17-15-00

So, what else could I do for this bug?

Comment 5 Qunfang Zhang 2010-04-12 10:26:56 UTC
(In reply to comment #4)

> 1. Remove 82572EI from PCIe slot, boot host with intel_iommu=on, PASS.
> 2. Plug the 82572EI to another PCIe slot, boot host with intel_iommu=on. PASS.
> 3. Plug the 82572EI to the original PCIe slot, boot host with intel_iommu=on.
> ==> *kernel oops*.
But it is strange we uses this PCIe slot for the NIC card all the time before.
That is to say, we have not moved the PCI devices for the testing before and just this time I meet the issue.

> 4. Without intel_iommu=on in the kernel line, all operations above work well.
> PASS.

Comment 6 Chris Wright 2010-04-12 18:00:47 UTC
Can you also include the contents of /proc/iomem?

I've seen a similar back trace once before, I'll need to refresh my memory on what the cause was.

Comment 7 Chris Wright 2010-04-12 18:03:19 UTC
Also, does intel_iommu=on iommu=pt make the problem go away?  And a full lspci -vvv would be useful.  Please use lspci binary from here:

http://et.redhat.com/~chrisw/rhel5/5.4/bin/lspci

Comment 8 Qunfang Zhang 2010-04-13 04:30:18 UTC
(In reply to comment #6)
> Can you also include the contents of /proc/iomem?
> 
> I've seen a similar back trace once before, I'll need to refresh my memory on
> what the cause was.    

# cat /proc/iomem 
00010000-000957ff : System RAM
00095800-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000ce3ff : Video ROM
000f0000-000fffff : System ROM
00100000-cefa57ff : System RAM
  00200000-0048472c : Kernel code
  0048472d-005cac67 : Kernel data
cefa5800-cfffffff : reserved
d0000000-dfffffff : PCI Bus #0f
  d0000000-dfffffff : 0000:0f:00.0
e0000000-e2ffffff : PCI Bus #0f
  e0000000-e1ffffff : 0000:0f:00.0
  e2000000-e2ffffff : 0000:0f:00.0
e3000000-e30fffff : PCI Bus #37
  e3000000-e3000fff : 0000:37:09.0
e3100000-e31fffff : PCI Bus #28
  e3100000-e311ffff : 0000:28:00.0
    e3100000-e311ffff : e1000e
  e3120000-e313ffff : 0000:28:00.0
    e3120000-e313ffff : e1000e
e3200000-e3bfffff : PCI Bus #1c
  e3200000-e321ffff : 0000:1c:00.0
    e3200000-e321ffff : igb
  e3220000-e323ffff : 0000:1c:00.1
    e3220000-e323ffff : igb
  e3240000-e3243fff : 0000:1c:00.0
    e3240000-e3243fff : igb
  e3244000-e3247fff : 0000:1c:00.1
    e3244000-e3247fff : igb
  e3400000-e37fffff : 0000:1c:00.0
    e3400000-e37fffff : igb
  e3800000-e3bfffff : 0000:1c:00.1
    e3800000-e3bfffff : igb
e3c00000-e3cfffff : PCI Bus #02
  e3c00000-e3c0ffff : 0000:02:00.0
    e3c00000-e3c0ffff : tg3
e3d00000-e3dfffff : PCI Bus #01
  e3d00000-e3d0ffff : 0000:01:00.0
    e3d00000-e3d0ffff : tg3
e3e00000-e3e03fff : 0000:00:1b.0
  e3e00000-e3e03fff : ICH HD audio
e3e04000-e3e047ff : 0000:00:1f.2
  e3e04000-e3e047ff : ahci
e3e04800-e3e04bff : 0000:00:1a.7
  e3e04800-e3e04bff : ehci_hcd
e3e04c00-e3e04fff : 0000:00:1d.7
  e3e04c00-e3e04fff : ehci_hcd
e3f00000-e3ffffff : PCI Bus #28
  e3f00000-e3f1ffff : 0000:28:00.0
e4000000-e40fffff : PCI Bus #41
  e4000000-e400ffff : 0000:41:00.0
    e4000000-e400ffff : mpt
  e4010000-e4013fff : 0000:41:00.0
    e4010000-e4013fff : mpt
e4200000-e43fffff : PCI Bus #41
  e4200000-e43fffff : 0000:41:00.0
e4400000-e4bfffff : PCI Bus #1c
  e4400000-e47fffff : 0000:1c:00.0
  e4800000-e4bfffff : 0000:1c:00.1
f0000000-f7ffffff : reserved
fec00000-fed3ffff : reserved
fed45000-ffffffff : reserved
100000000-32fffffff : System RAM

Comment 9 Qunfang Zhang 2010-04-13 04:45:51 UTC
(In reply to comment #7)
> Also, does intel_iommu=on iommu=pt make the problem go away? 
No, "intel_iommu=on iommu=pt" also have a call trace.

 And a full lspci
> -vvv would be useful.  Please use lspci binary from here:
> 
> http://et.redhat.com/~chrisw/rhel5/5.4/bin/lspci    

Attachment will be upload.

Comment 10 Qunfang Zhang 2010-04-13 04:46:43 UTC
Created attachment 406142 [details]
lspci -vvv using the given lspci binary

Comment 11 Chris Wright 2010-04-15 16:08:49 UTC
Does intel_iommu=on,strict make any difference?

Alternatively, I can disable queued invalidation (will require a patch) and verify that it works with register based invalidation.

The other thing that would help is the DMAR table.  To help get that you'll need to reboot and add 'debug' to the kernel command line.

Comment 12 Andrew Jones 2010-06-30 13:34:53 UTC
Qunfang,

Ping, can you try Chris' suggestions?

Drew

Comment 13 Don Dugger 2010-09-02 18:08:51 UTC
Chris - I'm assigning this bug to you since you seem to be on top of it.  You can always give it back if I'm not supposed to share the love this way :-)

Comment 14 Bill Burns 2010-11-24 16:46:06 UTC
Back at you Don...

Comment 15 Don Dugger 2010-12-07 19:31:26 UTC
Back to Qunfang, have you tried Chris' suggestions?

Comment 16 Qunfang Zhang 2010-12-09 01:53:30 UTC
(In reply to comment #15)
> Back to Qunfang, have you tried Chris' suggestions?

Hi, Don and Andrew, sorry for delay
I will reserve that machine try it asap.

Comment 17 Qunfang Zhang 2010-12-09 08:35:58 UTC
Hi, Don and Adrew

I got a HP Z800 host and the 2 specified NIC cards, then re-installed RHEL5.5 released OS.
But can not reproduce it this time though try to plug the 82572 and 82576 NIC cards in different slots.

Host can boot up successfully without any error in dmesg.

[root@dhcp-91-60 ~]# uname -r
2.6.18-194.el5

[root@dhcp-91-60 ~]# lspci  | grep Ether
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
1c:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
1c:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
28:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)

[root@dhcp-91-60 ~]# dmidecode -t 0
# dmidecode 2.10
SMBIOS 2.6 present.

Handle 0x0001, DMI type 0, 24 bytes
BIOS Information
	Vendor: Hewlett-Packard
	Version: 786G5 v01.17
	Release Date: 08/19/2009
	Address: 0xE0000
	Runtime Size: 128 kB
	ROM Size: 2048 kB
	Characteristics:
		PCI is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		EDD is supported
		Japanese floppy for Toshiba 1.2 MB is supported (int 13h)
		3.5"/720 kB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		ACPI is supported
		USB legacy is supported
		LS-120 boot is supported
		ATAPI Zip drive boot is supported
		BIOS boot specification is supported
		Function key-initiated network boot is supported
		Targeted content distribution is supported
	BIOS Revision: 1.17

Comment 18 Qunfang Zhang 2010-12-09 09:10:08 UTC
Hi,Don

Please ignore my last comment.
I continue trying and plug the NIC cards to other slots. this time can reproduce.

add "debug" to the kernel command line.But still can reproduce.

Attachment will be uploaded.

Comment 19 Qunfang Zhang 2010-12-09 09:11:08 UTC
Created attachment 467707 [details]
with_debug_in_cmdline_20101209

Comment 23 Laszlo Ersek 2011-04-29 08:48:22 UTC
Hello Qunfang,

does this reproduce with 5.6? If so, can you loan us the machine, leaving the cards in the problematic slots? Is the machine connected to a kvm appliance?

Thanks!

Comment 26 Laszlo Ersek 2011-07-29 10:15:23 UTC
Asking again, exactly three months later:

(In reply to comment #23)
> Hello Qunfang,
> 
> does this reproduce with 5.6? If so, can you loan us the machine, leaving the
> cards in the problematic slots? Is the machine connected to a kvm appliance?
> 
> Thanks!

except: can you please try with 5.7 now? Thanks!

Comment 27 Qunfang Zhang 2011-08-01 09:08:12 UTC
(In reply to comment #26)
> Asking again, exactly three months later:
> 
> (In reply to comment #23)
> > Hello Qunfang,
> > 
> > does this reproduce with 5.6? If so, can you loan us the machine, leaving the
> > cards in the problematic slots? Is the machine connected to a kvm appliance?
> > 
> > Thanks!
> 
> except: can you please try with 5.7 now? Thanks!

Hi, Laszlo
Sorry just come back from weekend, I will reserve the specific host and do it.

Comment 35 Don Dugger 2011-08-10 17:55:18 UTC
Odd, I got an email about the `needinfo' but I can't see that message in the BZ (and I'm the bug assignee but can't see all the comments - go figure).

Anyway, the `needinfo` was about machine access in a RH lab, I'm afraid I can't help you there (I'm with Intel :-)

Comment 36 Qunfang Zhang 2011-08-11 01:07:33 UTC
(In reply to comment #35)
> Odd, I got an email about the `needinfo' but I can't see that message in the BZ
> (and I'm the bug assignee but can't see all the comments - go figure).
> 
> Anyway, the `needinfo` was about machine access in a RH lab, I'm afraid I can't
> help you there (I'm with Intel :-)

Hi, Don
Thank you for the reply anyway. :) As the server room administrator claims the slot that got problem is for display card and other slots work well when plugging the NIC card. So I will close this issue as NOTABUG here.


Thanks
QUnfang


Note You need to log in before you can comment on or make changes to this bug.