Created attachment 405235 [details] Call trace Description of problem: Call trace on host when boot with "intel_iommu=on" in the kernel line. I use the Intel Z800 host. Install tree: nfs.englab.nay.redhat.com --dir=/pub/rhel/released/RHEL-5-Server/U5/x86_64/os Version-Release number of selected component (if applicable): 2.6.18-194.el5 How reproducible: 100% Steps to Reproduce: 1.Install RHEL5.5 OS on host using the install tree above. 2.After installation, add "intel_iommu=on" to the host kernel line. 3.Reboot host. Actual results: Call trace on host. (Attachment will be upload.) Expected results: Host can boot up successfully. Additional info: kernel command line: ro root=/dev/VolGroup00/LogVol00 intel_iommu=on #cat /proc/cpuinfo (here only list the last cpu) processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz stepping : 5 cpu MHz : 1596.000 cache size : 8192 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 6 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm bogomips : 5333.40 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: [8]
And I have enabled VT-d and VT-x in the BIOS. # dmidecode -t 0 # dmidecode 2.10 SMBIOS 2.6 present. Handle 0x0001, DMI type 0, 24 bytes BIOS Information Vendor: Hewlett-Packard Version: 786G5 v01.17 Release Date: 08/19/2009 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 2048 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported Japanese floppy for Toshiba 1.2 MB is supported (int 13h) 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) ACPI is supported USB legacy is supported LS-120 boot is supported ATAPI Zip drive boot is supported BIOS boot specification is supported Function key-initiated network boot is supported Targeted content distribution is supported BIOS Revision: 1.17
Downgrade kernel to 2.6.18-189.el5, this issue still exists.
After remove the Intel 82576 NIC card and boot the host with "intel_iommu=on" again, host boot up successfully. I will change another 82576 card to have a try.
I re-test the bug with the following conditions, and paste the results here: And there are two NIC cards plugged in the PCI slots.One is a 82576(2 ports) and another is 82572. #lspci | grep Etherent 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) 1c:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 1c:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 28:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) 01:00.0 and 02:00.0 are devices in the mainboard. 1. Remove 82572EI from PCIe slot, boot host with intel_iommu=on, PASS. 2. Plug the 82572EI to another PCIe slot, boot host with intel_iommu=on. PASS. 3. Plug the 82572EI to the original PCIe slot, boot host with intel_iommu=on. ==> *kernel oops*. 4. Without intel_iommu=on in the kernel line, all operations above work well. PASS. #lspci -vvv -s 28:00.0 28:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Intel Corporation PRO/1000 PT Desktop Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 59 Region 0: Memory at e3100000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at e3120000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at d000 [size=32] [virtual] Expansion ROM at e3f00000 [disabled] [size=128K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+ Address: 00000000fee00000 Data: 403b Capabilities: [e0] Express Endpoint IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <512ns, L1 <64us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ Device: MaxPayload 256 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0 Link: Latency L0s <4us, L1 <64us Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch- Link: Speed 2.5Gb/s, Width x1 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 67-26-5c-ff-ff-17-15-00 So, what else could I do for this bug?
(In reply to comment #4) > 1. Remove 82572EI from PCIe slot, boot host with intel_iommu=on, PASS. > 2. Plug the 82572EI to another PCIe slot, boot host with intel_iommu=on. PASS. > 3. Plug the 82572EI to the original PCIe slot, boot host with intel_iommu=on. > ==> *kernel oops*. But it is strange we uses this PCIe slot for the NIC card all the time before. That is to say, we have not moved the PCI devices for the testing before and just this time I meet the issue. > 4. Without intel_iommu=on in the kernel line, all operations above work well. > PASS.
Can you also include the contents of /proc/iomem? I've seen a similar back trace once before, I'll need to refresh my memory on what the cause was.
Also, does intel_iommu=on iommu=pt make the problem go away? And a full lspci -vvv would be useful. Please use lspci binary from here: http://et.redhat.com/~chrisw/rhel5/5.4/bin/lspci
(In reply to comment #6) > Can you also include the contents of /proc/iomem? > > I've seen a similar back trace once before, I'll need to refresh my memory on > what the cause was. # cat /proc/iomem 00010000-000957ff : System RAM 00095800-0009ffff : reserved 000a0000-000bffff : Video RAM area 000c0000-000ce3ff : Video ROM 000f0000-000fffff : System ROM 00100000-cefa57ff : System RAM 00200000-0048472c : Kernel code 0048472d-005cac67 : Kernel data cefa5800-cfffffff : reserved d0000000-dfffffff : PCI Bus #0f d0000000-dfffffff : 0000:0f:00.0 e0000000-e2ffffff : PCI Bus #0f e0000000-e1ffffff : 0000:0f:00.0 e2000000-e2ffffff : 0000:0f:00.0 e3000000-e30fffff : PCI Bus #37 e3000000-e3000fff : 0000:37:09.0 e3100000-e31fffff : PCI Bus #28 e3100000-e311ffff : 0000:28:00.0 e3100000-e311ffff : e1000e e3120000-e313ffff : 0000:28:00.0 e3120000-e313ffff : e1000e e3200000-e3bfffff : PCI Bus #1c e3200000-e321ffff : 0000:1c:00.0 e3200000-e321ffff : igb e3220000-e323ffff : 0000:1c:00.1 e3220000-e323ffff : igb e3240000-e3243fff : 0000:1c:00.0 e3240000-e3243fff : igb e3244000-e3247fff : 0000:1c:00.1 e3244000-e3247fff : igb e3400000-e37fffff : 0000:1c:00.0 e3400000-e37fffff : igb e3800000-e3bfffff : 0000:1c:00.1 e3800000-e3bfffff : igb e3c00000-e3cfffff : PCI Bus #02 e3c00000-e3c0ffff : 0000:02:00.0 e3c00000-e3c0ffff : tg3 e3d00000-e3dfffff : PCI Bus #01 e3d00000-e3d0ffff : 0000:01:00.0 e3d00000-e3d0ffff : tg3 e3e00000-e3e03fff : 0000:00:1b.0 e3e00000-e3e03fff : ICH HD audio e3e04000-e3e047ff : 0000:00:1f.2 e3e04000-e3e047ff : ahci e3e04800-e3e04bff : 0000:00:1a.7 e3e04800-e3e04bff : ehci_hcd e3e04c00-e3e04fff : 0000:00:1d.7 e3e04c00-e3e04fff : ehci_hcd e3f00000-e3ffffff : PCI Bus #28 e3f00000-e3f1ffff : 0000:28:00.0 e4000000-e40fffff : PCI Bus #41 e4000000-e400ffff : 0000:41:00.0 e4000000-e400ffff : mpt e4010000-e4013fff : 0000:41:00.0 e4010000-e4013fff : mpt e4200000-e43fffff : PCI Bus #41 e4200000-e43fffff : 0000:41:00.0 e4400000-e4bfffff : PCI Bus #1c e4400000-e47fffff : 0000:1c:00.0 e4800000-e4bfffff : 0000:1c:00.1 f0000000-f7ffffff : reserved fec00000-fed3ffff : reserved fed45000-ffffffff : reserved 100000000-32fffffff : System RAM
(In reply to comment #7) > Also, does intel_iommu=on iommu=pt make the problem go away? No, "intel_iommu=on iommu=pt" also have a call trace. And a full lspci > -vvv would be useful. Please use lspci binary from here: > > http://et.redhat.com/~chrisw/rhel5/5.4/bin/lspci Attachment will be upload.
Created attachment 406142 [details] lspci -vvv using the given lspci binary
Does intel_iommu=on,strict make any difference? Alternatively, I can disable queued invalidation (will require a patch) and verify that it works with register based invalidation. The other thing that would help is the DMAR table. To help get that you'll need to reboot and add 'debug' to the kernel command line.
Qunfang, Ping, can you try Chris' suggestions? Drew
Chris - I'm assigning this bug to you since you seem to be on top of it. You can always give it back if I'm not supposed to share the love this way :-)
Back at you Don...
Back to Qunfang, have you tried Chris' suggestions?
(In reply to comment #15) > Back to Qunfang, have you tried Chris' suggestions? Hi, Don and Andrew, sorry for delay I will reserve that machine try it asap.
Hi, Don and Adrew I got a HP Z800 host and the 2 specified NIC cards, then re-installed RHEL5.5 released OS. But can not reproduce it this time though try to plug the 82572 and 82576 NIC cards in different slots. Host can boot up successfully without any error in dmesg. [root@dhcp-91-60 ~]# uname -r 2.6.18-194.el5 [root@dhcp-91-60 ~]# lspci | grep Ether 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) 1c:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 1c:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 28:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) [root@dhcp-91-60 ~]# dmidecode -t 0 # dmidecode 2.10 SMBIOS 2.6 present. Handle 0x0001, DMI type 0, 24 bytes BIOS Information Vendor: Hewlett-Packard Version: 786G5 v01.17 Release Date: 08/19/2009 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 2048 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported Japanese floppy for Toshiba 1.2 MB is supported (int 13h) 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) ACPI is supported USB legacy is supported LS-120 boot is supported ATAPI Zip drive boot is supported BIOS boot specification is supported Function key-initiated network boot is supported Targeted content distribution is supported BIOS Revision: 1.17
Hi,Don Please ignore my last comment. I continue trying and plug the NIC cards to other slots. this time can reproduce. add "debug" to the kernel command line.But still can reproduce. Attachment will be uploaded.
Created attachment 467707 [details] with_debug_in_cmdline_20101209
Hello Qunfang, does this reproduce with 5.6? If so, can you loan us the machine, leaving the cards in the problematic slots? Is the machine connected to a kvm appliance? Thanks!
Asking again, exactly three months later: (In reply to comment #23) > Hello Qunfang, > > does this reproduce with 5.6? If so, can you loan us the machine, leaving the > cards in the problematic slots? Is the machine connected to a kvm appliance? > > Thanks! except: can you please try with 5.7 now? Thanks!
(In reply to comment #26) > Asking again, exactly three months later: > > (In reply to comment #23) > > Hello Qunfang, > > > > does this reproduce with 5.6? If so, can you loan us the machine, leaving the > > cards in the problematic slots? Is the machine connected to a kvm appliance? > > > > Thanks! > > except: can you please try with 5.7 now? Thanks! Hi, Laszlo Sorry just come back from weekend, I will reserve the specific host and do it.
Odd, I got an email about the `needinfo' but I can't see that message in the BZ (and I'm the bug assignee but can't see all the comments - go figure). Anyway, the `needinfo` was about machine access in a RH lab, I'm afraid I can't help you there (I'm with Intel :-)
(In reply to comment #35) > Odd, I got an email about the `needinfo' but I can't see that message in the BZ > (and I'm the bug assignee but can't see all the comments - go figure). > > Anyway, the `needinfo` was about machine access in a RH lab, I'm afraid I can't > help you there (I'm with Intel :-) Hi, Don Thank you for the reply anyway. :) As the server room administrator claims the slot that got problem is for display card and other slots work well when plugging the NIC card. So I will close this issue as NOTABUG here. Thanks QUnfang