Bug 849223
Description
Pasi Karkkainen
2012-08-17 18:10:41 UTC
Hi, is this a regression according to your knowledge? Please try with "pci=nomsi" on the guest kernel command line. Please upload - guest kernel dmesg (with "ignore_loglevel"), - dom0 dmesg (ditto), - hypervisor serial console output ("loglvl=all guest_loglvl=all"), - xend.log from dom0, - "lspci -vvv" from host & guest. The version/component will probably change to RHEL-5 kernel-xen or RHEL-6 kernel. Thanks! Laszlo (In reply to comment #1) > Hi, > Hello! > is this a regression according to your knowledge? > Not sure.. earlier in RHEL <= 5.7 there was other SR-IOV related bugs, so I haven't been able to test this properly earlier. > Please try with "pci=nomsi" on the guest kernel command line. > I actually already tried that earlier but forgot to mention about it. pci=nomsi on the HVM guest kernel cmdline makes it crash and reboot itself when I do "ifconfig eth1 up" inside the guest.. > Please upload > - guest kernel dmesg (with "ignore_loglevel"), > - dom0 dmesg (ditto), > - hypervisor serial console output ("loglvl=all guest_loglvl=all"), > - xend.log from dom0, > - "lspci -vvv" from host & guest. > Ok, will do next week. > The version/component will probably change to RHEL-5 kernel-xen or RHEL-6 > kernel. > > Thanks! > Laszlo Yep, thanks! (In reply to comment #3) > I actually already tried that earlier but forgot to mention about it. > pci=nomsi on the HVM guest kernel cmdline makes it crash and reboot itself > when I do "ifconfig eth1 up" inside the guest.. Maybe a guest regression then... I can see some ixgbevf/SR-IOV related changes between 6.2 and 6.3. (In reply to comment #4) > (In reply to comment #3) > > > I actually already tried that earlier but forgot to mention about it. > > pci=nomsi on the HVM guest kernel cmdline makes it crash and reboot itself > > when I do "ifconfig eth1 up" inside the guest.. > > Maybe a guest regression then... I can see some ixgbevf/SR-IOV related > changes between 6.2 and 6.3. > I quickly tried with 6.2 kernel, and behaviour was the same. no interrupts received, all the interrupt counts are and stay zero for the VF. I didn't forget about the logs, but it'll take a couple of days before I can fix the serial console etc. Created attachment 606043 [details]
rhel58 x64 xen hypervisor serial console log
Created attachment 606044 [details]
rhel58 x64 xen dom0 linux kernel dmesg log
Created attachment 606045 [details]
rhel58 x64 xen dom0 lspci -vvv
Created attachment 606046 [details]
rhel58 x64 xen dom0 xend log
Created attachment 606050 [details]
rhel63 x64 xen hvm guest linux kernel dmesg log
Created attachment 606051 [details]
rhel63 x64 xen hvm guest lspci -vvv
(In reply to comment #1) > > Please upload > - guest kernel dmesg (with "ignore_loglevel"), > - dom0 dmesg (ditto), > - hypervisor serial console output ("loglvl=all guest_loglvl=all"), > - xend.log from dom0, > - "lspci -vvv" from host & guest. > Done. (In reply to comment #11) > Created attachment 606046 [details] > rhel58 x64 xen dom0 xend log This bug seems to be a duplicate of bug 735890. See especially bug 735890 comment 16. (Note that the dup candidate is about PV passthrough, but I think that should make no difference for the MSI(-X) range's availability.) From comment 0, we're passing through 03:10.2. From comment 10 (extract): 03:10.2 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) Region 0: [virtual] Memory at de404000 (64-bit, non-prefetchable) [size=16K] Region 3: [virtual] Memory at de504000 (64-bit, non-prefetchable) [size=16K] Capabilities: [70] MSI-X: Enable- Count=3 Masked- Vector table: BAR=3 offset=00000000 /* #1 */ PBA: BAR=3 offset=00002000 /* #2 */ From the xend.log: (pciquirk:91) NO quirks found for PCI device [8086:10ed:8086:7a11] (pciquirk:131) Permissive mode NOT enabled for PCI device [8086:10ed:8086:7a11] (pciif:378) pci: enabling iomem 0xde404000/0x4000 pfn 0xde404/0x4 (pciif:378) pci: enabling iomem 0xde504000/0x4000 pfn 0xde504/0x4 These correspond to Region 0 and Region 3 above. (pciif:398) pci-msix: remove permission for 0xde504000/0x3000 0xde504/0x3 This is MSI-X range #1 ("Vector table") inside Region 3 ("BAR=3"), offset 0: 0xde504000 == 0xde504000 + 0. (pciif:398) pci-msix: remove permission for 0xde506000/0x1000 0xde506/0x1 This is MSI-X range #2 ("PBA") inside Region 3 ("BAR=3"), offset 0x2000: 0xde506000 == 0xde504000 + 0x2000. (In reply to comment #15) > This bug seems to be a duplicate of bug 735890. See especially bug 735890 > comment 16. (Note that the dup candidate is about PV passthrough, but I > think that should make no difference for the MSI(-X) range's availability.) Actually I may be very wrong about this... the HVM guest would try to access these ranges via the IOMMU. (XEN) [VT-D]iommu.c:1241:d32767 domain_context_mapping:PCIe: bdf = 3:10.2 Let's try to attack it from another side (*) -- when the guest crashes with "pci=nomsi" (comment 3), does it dump the stack to its serial console? Does Xen or dom0 log anything? (I'd like to reproduce this and get a vmcore myself, but I'm still waiting on a Beaker box with such a card.) (*) There may be a "common" IRQ setup problem, and the "normal" PCI interrupt path could be less complicated to debug. Can you please boot a bare-metal kernel on this machine and run acpidump --table DMAR --binary -o DMAR.dump and attach "DMAR.dump"? Thanks. Can you please also retry with "iommu=no-intremap" on the xen.gz command line? Thanks! Created attachment 606365 [details]
Dell R510 DMAR dump from acpidump
(In reply to comment #17) > Can you please boot a bare-metal kernel on this machine and run > > acpidump --table DMAR --binary -o DMAR.dump > > and attach "DMAR.dump"? Thanks. > Done. (In reply to comment #18) > Can you please also retry with "iommu=no-intremap" on the xen.gz command > line? Thanks! > I tried it, but unfortunately it didn't help.. still the same problem. Created attachment 606370 [details]
rhel63 x64 xen hvm guest linux kernel crash with pci=nomsi
(In reply to comment #16) > (In reply to comment #15) > > > This bug seems to be a duplicate of bug 735890. See especially bug 735890 > > comment 16. (Note that the dup candidate is about PV passthrough, but I > > think that should make no difference for the MSI(-X) range's availability.) > > Actually I may be very wrong about this... the HVM guest would try to access > these ranges via the IOMMU. > > (XEN) [VT-D]iommu.c:1241:d32767 domain_context_mapping:PCIe: bdf = 3:10.2 > > Let's try to attack it from another side (*) -- when the guest crashes with > "pci=nomsi" (comment 3), does it dump the stack to its serial console? Does > Xen or dom0 log anything? (I'd like to reproduce this and get a vmcore > myself, but I'm still waiting on a Beaker box with such a card.) > > (*) There may be a "common" IRQ setup problem, and the "normal" PCI > interrupt path could be less complicated to debug. > Ok, I booted rhel6.3 x64 hvm guest with with pci=nomsi on the guest kernel cmdline, and when I do "ifconfig eth1 up" for the VF in the HVM guest I get the attached crash. Created attachment 606639 [details] decompiled Dell-R510-DMAR.dsl I have no idea what could be going wrong. The DMAR doesn't seem to violate anything described in "Intel(r)_VT_for_Direct_IO.pdf". Both IO-APIC's found in the MADT are listed in the DMAR/DRHD. I can neither prove nor disprove there's a mismatch between hardware & the DMAR. RMRR's indeed point into reserved RAM. This kind of IOMMU bug is hard (see bug 760007, bug 512617 etc...) Whenever a device to be passed through is down one (or more) PCI-to-PCI bridges, we suck at passing it through. (You might want to check that with "lspci -tv" in a bare-metal kernel.) I see (XEN) io_apic.c:2161: (XEN) ioapic_guest_write: apic=0, pin=3, old_irq=3, new_irq=3 (XEN) ioapic_guest_write: old_entry=000000f2, new_entry=000100f2 (XEN) ioapic_guest_write: Attempt to modify IO-APIC pin for in-use IRQ! in the hypervisor log, but this kind of message is printed all the time without adverse effects. Also the DRHD reports FED90000 as register base address; see the attachment plus: (XEN) [VT-D]dmar.c:477: found ACPI_DMAR_DRHD (XEN) [VT-D]dmar.c:336: dmaru->address = fed90000 and dom0 logs pnp: 00:0b: iomem range 0xfed90000-0xfed91fff could not be reserved but this may not mean anything if dom0 is not supposed to access the DMA remapping unit directly. Perhaps try "iommu=passthrough" on the xen.gz command line, but it's just shotgun experimentation now. Does RHEL-63 HVM work under upstream Xen+dom0? (Even if it does, I've looked at upstream IOMMU patches before, and I can either not pick candidates, or the changes are very invasive). (In reply to comment #22) > Created attachment 606370 [details] > rhel63 x64 xen hvm guest linux kernel crash with pci=nomsi BUG: unable to handle kernel NULL pointer dereference at (null) (gdb) file ixgbevf.ko Reading symbols from /home/lacos/tmp/ixgbevf.ko... Reading symbols from /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/\ kernel/drivers/net/ixgbevf/ixgbevf.ko.debug... done. done. (gdb) list *(ixgbevf_open+0x475) 0x62c5 is in ixgbevf_open (include/linux/interrupt.h:126). 121 122 static inline int __must_check 123 request_irq(unsigned int irq, irq_handler_t handler, 124 unsigned long flags, const char *name, void *dev) 125 { 126 return request_threaded_irq(irq, handler, NULL, flags, name, dev); 127 } 128 129 extern void exit_irq_thread(void); 130 #else ixgbevf_probe() ixgbevf_init_interrupt_scheme() ixgbevf_set_interrupt_capability() ixgbevf_acquire_msix_vectors() pci_enable_msix() ixgbevf_open() ixgbevf_request_irq() ixgbevf_request_msix_irqs() request_irq() It might be useful to match the above callgraph against the full guest dmesg, but I believe ixgbevf can't work without MSI-X. (In reply to comment #25) > Whenever > a device to be passed through is down one (or more) PCI-to-PCI bridges, we > suck at passing it through. (You might want to check that with "lspci -tv" > in a bare-metal kernel.) ATSR lists these: 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 13) (prog-if 00 [Normal decode]) 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) (prog-if 00 [Normal decode]) 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 13) (prog-if 00 [Normal decode]) 00:09.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 13) (prog-if 00 [Normal decode]) 00:0a.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 10 (rev 13) (prog-if 00 [Normal decode]) "The ATSR structures identifies PCI Express Root-Ports supporting Address Translation Services (ATS) transactions." The dom0 log / lspci include PCI: Transparent bridge - 0000:00:1e.0 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) (prog-if 01 [Subtractive decode]) The "82599 Ethernet Controller Virtual Function"s have "[virtual] Memory" bases in [de400000..de71c000]; all of which fall into 00:07.0's "Memory behind bridge: de200000-de7fffff". I'm travelling this week, but I'll try the suggestions next week.. > Actually I may be very wrong about this... the HVM guest would try to access > these ranges via the IOMMU. I think the point is that interrupts are processed by the hypervisor and forwarded to the guest. For this reason allowing access to the MSI-X ranges (no matter if via IOMMU or directly) is a no-no. You need instead to emulate those and program the hypervisor appropriately. This is what tools/ioemu/hw/pt-msi.c does. Problem is, our QEMU with the upstream qemu-xen tree are so different that from a quick look I hardly can tell if we have the relevant commits upstream (mostly commit 7551a51, passthrough: use devfn instead of slots as the unit for pass-through, 2009-06-25). It seems like we do (see bug 581655). Pasi, can you: 1) attach the qemu-dm logs too? 2) try passing the whole NIC to the guest, and then bring up the VF? (In reply to comment #29) > 2) try passing the whole NIC to the guest, and then bring up the VF? Hmm right, I recall repeated recommendations from QE to pass through all functions, whenever I played with SR-IOV before. (In reply to comment #25) > > Perhaps try "iommu=passthrough" on the xen.gz command line, but it's just > shotgun experimentation now. > Unfortunately that didn't seem to help. > Does RHEL-63 HVM work under upstream Xen+dom0? (Even if it does, I've looked > at upstream IOMMU patches before, and I can either not pick candidates, or > the changes are very invasive). > I'll try this later. Created attachment 610520 [details]
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest
(In reply to comment #29) > > Pasi, can you: > > 1) attach the qemu-dm logs too? > Done. (In reply to comment #30) > (In reply to comment #29) > > > 2) try passing the whole NIC to the guest, and then bring up the VF? > > Hmm right, I recall repeated recommendations from QE to pass through all > functions, whenever I played with SR-IOV before. > Ok, so I blacklisted ixgbe and hid the PF PCI ids in dom0, and passed thru one PF to the HVM guest. Loading ixgbe driver for the PF works in RHEL 6.3 HVM guest, and the PF works OK. I can see interrupt count increasing for the PF while pinging: [root@rhel63x64hvm ~]# grep eth1 /proc/interrupts 48: 348 PCI-MSI-edge eth1 So far all good. Adding "max_vfs=8" option in the HVM guest for the ixgbe module is where the problems begin: [root@c63x64hvm ~]# dmesg | grep ixgbe ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 3.6.7-k ixgbe: Copyright (c) 1999-2012 Intel Corporation. ixgbe 0000:00:06.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40 ixgbe 0000:00:06.0: setting latency timer to 64 ixgbe 0000:00:06.0: (unregistered net_device): Failed to enable PCI sriov: -19 ixgbe 0000:00:06.0: (unregistered net_device): ATR is not supported while multiple queues are disabled. Disabling Flow Director ixgbe 0000:00:06.0: irq 48 for MSI/MSI-X ixgbe 0000:00:06.0: Multiqueue Disabled: Rx Queue count = 1, Tx Queue count = 1 ixgbe 0000:00:06.0: (PCI Express:5.0GT/s:Width x8) 00:2b:31:77:9e:1c ixgbe 0000:00:06.0: MAC: 2, PHY: 8, SFP+: 3, PBA No: E81283-002 ixgbe 0000:00:06.0: Intel(R) 10 Gigabit Network Connection ixgbe 0000:00:06.0: eth1: detected SFP+: 3 ixgbe 0000:00:06.0: eth1: NIC Link is Up 10 Gbps, Flow Control: RX/TX So especially this: ixgbe 0000:00:06.0: (unregistered net_device): Failed to enable PCI sriov: -19 Any ideas? (In reply to comment #34) > So especially this: > ixgbe 0000:00:06.0: (unregistered net_device): Failed to enable PCI sriov: > -19 The direct reason could be ixgbe_enable_sriov() pci_enable_sriov() if (!dev->is_physfn) return -ENODEV; or ixgbe_enable_sriov() pci_enable_sriov() sriov_enable() if (iov->link != dev->devfn) { pdev = pci_get_slot(dev->bus, iov->link); if (!pdev) return -ENODEV; pci_dev_put(pdev); if (!pdev->is_physfn) return -ENODEV; But that doesn't tell me much. What if you don't pass through the physical device (03:00.*), but pass through *all* the VFs instead (03:10.*)? Can you please repeat your original test with pci = [ '03:10.0', '03:10.1', ..., '03:10.7' ] in the vm config file? Thanks. (In reply to comment #32) > Created attachment 610520 [details] > rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest "pt_pci_read_config: Warning: Return ALL F from libpci read. [00:06.0][Offset:00h][Length:2]" I changed to max_vfs=2,2 for ixgbe in dom0, and then passed thru all the four VFs to the HVM guest: [root@dom0 ~]# grep pci /etc/xen/rhel63x64hvm xen_platform_pci=0 pci = [ '03:10.0', '03:10.1', '03:10.2', '03:10.3' ] The interesting part is that I can only see 2 VFs in the guest! [root@rhel63x64hvm ~]# lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:01.3 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:02.0 VGA compatible controller: Device 1234:1111 00:05.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01) 00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) [root@rhel63x64hvm ~]# dmesg | grep -i ixgbe ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver - version 2.2.0-k ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation. ixgbevf 0000:00:06.0: setting latency timer to 64 ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X ixgbevf 0000:00:07.0: setting latency timer to 64 ixgbevf 0000:00:07.0: irq 51 for MSI/MSI-X ixgbevf 0000:00:07.0: irq 52 for MSI/MSI-X ixgbevf 0000:00:07.0: irq 53 for MSI/MSI-X Also the VFs don't work. I configured an IP to them, tried pinging, but it doesn't work. Also the interrupt counters stay at zero in /proc/interrupts. [root@rhel63x64hvm ~]# grep eth1 /proc/interrupts 48: 0 PCI-MSI-edge eth1-rx-0 49: 0 PCI-MSI-edge eth1-tx-0 50: 0 PCI-MSI-edge eth1:mbx Created attachment 611505 [details]
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with multiple VFs
four (all) VFs passed thru, only two of them are visible in the HVM guest.
I also tried with max_vfs=4,4 in dom0, so that gives 8 VFs total, and I passed all 8 to the HVM guest. Inside the RHEL6 HVM guest still only 2 were visible in lspci. Please get similar logs for a RHEL5 guest, too. Thanks! Ok, I just tried with RHEL5.8 x64 HVM guest. The first time I did "ifconfig eth1 up" inside the HVM guest the guest crashed with a kernel panic! Unfortunately I didn't have serial console set up then, so I couldn't capture the guest kernel crash. The second time I tried it, the VF actually works ! I tried rebooting the guest again, and it still works. Dunno what was wrong on the first time.. [root@rhel58x64hvm ~]# lspci -vvv 00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) Subsystem: Intel Corporation Device 7a11 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64 Region 0: Memory at f4020000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at f4024000 (64-bit, non-prefetchable) [size=16K] Capabilities: [70] MSI-X: Enable+ Count=3 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Kernel driver in use: ixgbevf Kernel modules: ixgbevf [root@rhel58x64hvm ~]# cat /proc/interrupts CPU0 0: 341156 IO-APIC-edge timer 1: 9 IO-APIC-edge i8042 6: 2 IO-APIC-edge floppy 8: 1 IO-APIC-edge rtc 9: 0 IO-APIC-level acpi 12: 111 IO-APIC-edge i8042 14: 4043 IO-APIC-edge ide0 15: 47 IO-APIC-edge ide1 169: 41 IO-APIC-level uhci_hcd:usb1 177: 2019 IO-APIC-level eth0 193: 4337 PCI-MSI-X eth1-rx-0 201: 26 PCI-MSI-X eth1-tx-0 209: 24 PCI-MSI-X eth1:mbx 217: 376 IO-APIC-level xen-platform-pci NMI: 0 LOC: 341675 ERR: 0 MIS: 0 [root@rhel58x64hvm ~]# ethtool -i eth1 driver: ixgbevf version: 2.1.0-k firmware-version: N/A bus-info: 0000:00:06.0 I'll attach qemu-dm.log for the rhel5.8 hvm guest. Created attachment 613668 [details]
rhel58 x64 xen qemu-dm log for rhel58 x64 hvm guest
Hmm, for the working rhel5.8 HVM guest the ixgbevf interrupts are PCI-MSI-X, while for the non-working rhel6.3 HVM guest the interrupts are PCI-MSI-edge, is that relevant? Hello, Laszlo I find a machine very close to Pasi's env: R510, intel-e5606 But till now I am not sure whether the bug is reproducible in that machine: The 82599 NIC is not plugged in with fiber network line because we have limited number of fiber network port in our office. So I get following info in the guest: ifconfig eth0 up ixgbevf: Unable to start - perhaps the PF Driver isn't up yet SIOCSIFFLAGS: Network is down lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01) 00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) cat /proc/interrupts CPU0 CPU1 0: 153 0 IO-APIC-edge timer 1: 685 21 IO-APIC-edge i8042 4: 1313 119 IO-APIC-edge serial 8: 0 1 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 693 169 IO-APIC-edge i8042 14: 204 236 IO-APIC-edge ata_piix 15: 0 0 IO-APIC-edge ata_piix 28: 5865 5757 IO-APIC-fasteoi xen-platform-pci 510: 5849 0 xen-dyn-event blkif 511: 75 0 xen-dyn-event xenbus NMI: 0 0 Non-maskable interrupts LOC: 24338 11525 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RES: 1421 2804 Rescheduling interrupts CAL: 149 316 Function call interrupts TLB: 306 720 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 2 2 Machine check polls ERR: 0 MIS: 11 The eng-ops guy is off duty now, so I need to try this again tomorrow if I can get the fiber link. (In reply to comment #48) > > So I get following info in the guest: > > ifconfig eth0 up > ixgbevf: Unable to start - perhaps the PF Driver isn't up yet > SIOCSIFFLAGS: Network is down > Is the Physical Function (PF) interface "up" in dom0 ? So first you need to "ifconfig ethX up" the PF in dom0, and after that "ifconfig ethX up" the VF in the VM. Btw I'm using DA (SFP+ Direct Attach) NICs and cables, so no fiber, if that makes a difference.. (In reply to comment #48) > > The eng-ops guy is off duty now, so I need to try this again tomorrow if I > can get the fiber link. I think the problem could be reproduced after plug the fiber link into the 82599 card. When pass-through the 82599 card into RHEL6 (kernel-2.6.32-279) guest, it could not get IP address (guest will crash and keep rebooting when with "pci=nomsi" in kernel cmd line). The same test works with RHEL5.9 guest (kernel-2.6.18-339). On RHEL6.3 guest: # lspci -D | grep 82599 0000:00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 0000:00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) # lspci -vvv 00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) Subsystem: Intel Corporation Device 7a11 Physical Slot: 1 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64 Region 0: Memory at f4000000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at f4004000 (64-bit, non-prefetchable) [size=16K] Capabilities: [70] MSI-X: Enable+ Count=3 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Kernel driver in use: ixgbevf Kernel modules: ixgbevf # grep eth0 /proc/interrupts 48: 0 0 PCI-MSI-edge eth0-rx-0 49: 0 0 PCI-MSI-edge eth0-tx-0 50: 0 0 PCI-MSI-edge eth0:mbx # dmesg | grep -i ixgbe ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver - version 2.2.0-k ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation. ixgbevf 0000:00:06.0: setting latency timer to 64 ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X ixgbevf 0000:00:07.0: setting latency timer to 64 ixgbevf 0000:00:07.0: PF still in reset state, assigning new address ixgbevf 0000:00:07.0: irq 51 for MSI/MSI-X ixgbevf 0000:00:07.0: irq 52 for MSI/MSI-X ixgbevf 0000:00:07.0: irq 53 for MSI/MSI-X ixgbevf: Unable to start - perhaps the PF Driver isn't up yet (In reply to comment #47) > Hmm, for the working rhel5.8 HVM guest the ixgbevf interrupts are PCI-MSI-X, > while for the non-working rhel6.3 HVM guest the interrupts are PCI-MSI-edge, > is that relevant? Probably... (In reply to comment #50) > I think the problem could be reproduced after plug the fiber link into the > 82599 card. > > When pass-through the 82599 card into RHEL6 (kernel-2.6.32-279) guest, it > could not get IP address (guest will crash and keep rebooting when with > "pci=nomsi" in kernel cmd line). > > The same test works with RHEL5.9 guest (kernel-2.6.18-339). I assume this means the RHEL5.9 guest works *without* "pci=nomsi". (While the RHEL6.3 guest doesn't work without it, and crashes with it.) (In reply to comment #46) > Created attachment 613668 [details] > rhel58 x64 xen qemu-dm log for rhel58 x64 hvm guest Grepping attachment 610520 [details], attachment 611505 [details] and attachment 613668 [details] for "first_map=", the qemu-dm logs for the RHEL-6 guest(s) only contain "first_map=1" entries, whereas the qemu-dm log for the RHEL-5 guest also has "first_map=0" lines. These are logged by pt_iomem_map() [tools/ioemu/hw/pass-through.c] in qemu-dm. The add_msix_mapping() call depends on (first_map==0). > void pt_iomem_map(PCIDevice *d, int i, uint32_t e_phys, uint32_t e_size, > int type) > { > struct pt_dev *assigned_device = (struct pt_dev *)d; > uint32_t old_ebase = assigned_device->bases[i].e_physbase; > int first_map = ( assigned_device->bases[i].e_size == 0 ); > int ret = 0; > > assigned_device->bases[i].e_physbase = e_phys; > assigned_device->bases[i].e_size= e_size; > > PT_LOG("e_phys=%08x maddr=%lx type=%d len=%d index=%d first_map=%d\n", > e_phys, (unsigned long)assigned_device->bases[i].access.maddr, > type, e_size, i, first_map); > > if ( e_size == 0 ) > return; > > if ( !first_map && old_ebase != -1 ) > { > add_msix_mapping(assigned_device, i); > /* Remove old mapping */ > ret = xc_domain_memory_mapping(xc_handle, domid, > old_ebase >> XC_PAGE_SHIFT, > assigned_device->bases[i].access.maddr >> XC_PAGE_SHIFT, > (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT, > DPCI_REMOVE_MAPPING); > if ( ret != 0 ) > { > PT_LOG("Error: remove old mapping failed!\n"); > return; > } > } > > /* map only valid guest address */ > if (e_phys != -1) > { This branch should run for each first_map=1 line though, for those e_phys is never UINT_MAX. > /* Create new mapping */ > ret = xc_domain_memory_mapping(xc_handle, domid, > assigned_device->bases[i].e_physbase >> XC_PAGE_SHIFT, > assigned_device->bases[i].access.maddr >> XC_PAGE_SHIFT, > (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT, > DPCI_ADD_MAPPING); > > if ( ret != 0 ) > { > PT_LOG("Error: create new mapping failed!\n"); > } > > ret = remove_msix_mapping(assigned_device, i); I don't understand what & why we remove here, right after the addition. > if ( ret != 0 ) > PT_LOG("Error: remove MSI-X mmio mapping failed!\n"); > > if ( old_ebase != e_phys && old_ebase != -1 ) > pt_msix_update_remap(assigned_device, i); > } Most of this function comes from commit f39cc738 ("xen-3.0.3-86.el5"), but these last lines are from 694b84d3 ("MSI-X mask bit acceleration"). > } Also,
* rhel58-x64-xen-qemu-dm-log-for-rhel58-x64-hvm.txt:
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
pt_msix_update_one: now update msix entry 0 with pirq ff gvec c1
pt_msix_update_one: now update msix entry 1 with pirq fe gvec c9
pt_msix_update_one: now update msix entry 2 with pirq fd gvec d1
* rhel58-x64-xen-qemu-dm-log-for-rhel63-x64-hvm.txt:
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
(4 times, no "now update msix entry" msgs)
* rhel58-x64-xen-qemu-dm-log-for-rhel63-x64-hvm-02.txt:
pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h
(8 times, no "now update msix entry" msgs)
* "now update msix entry" is printed by pt_msix_update_one(); call sites:
pt_iomem_map()
pt_msix_update_remap()
pt_msix_update()
pt_msix_update_one()
pt_msixctrl_reg_write() <- registered for "MSI-X Capability Structure reg
group" / "Message Control reg" in pt_config_init
pt_msix_update()
pt_msix_update_one()
pci_msix_writel() <- registered for iomem writes in pt_msix_init
pt_msix_update_one()
All three logs describe calls to pt_msixctrl_reg_write(), which is able to
skip the call to pt_msix_update():
> /* write Message Control register for MSI-X */
> static int pt_msixctrl_reg_write(struct pt_dev *ptdev,
> struct pt_reg_tbl *cfg_entry,
> uint16_t *value, uint16_t dev_value, uint16_t valid_mask)
> {
> struct pt_reg_info_tbl *reg = cfg_entry->reg;
> uint16_t writable_mask = 0;
> uint16_t throughable_mask = 0;
> uint16_t old_ctrl = cfg_entry->data;
>
> /* modify emulate register */
> writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> cfg_entry->data = ((*value & writable_mask) |
> (cfg_entry->data & ~writable_mask));
>
> PT_LOG("old_ctrl:%04xh new_ctrl:%04xh\n", old_ctrl, cfg_entry->data);
>
> /* create value for writing to I/O device register */
> throughable_mask = ~reg->emu_mask & valid_mask;
> *value = ((*value & throughable_mask) | (dev_value & ~throughable_mask));
>
> /* update MSI-X */
> if ((*value & PCI_MSIX_ENABLE) && !(*value & PCI_MSIX_MASK))
> pt_msix_update(ptdev);
>
> ptdev->msix->enabled = !!(*value & PCI_MSIX_ENABLE);
>
> return 0;
> }
Created attachment 614372 [details]
additional debug messages for qemu-dm
Created attachment 614373 [details]
qemu-dm log (with debug patch) about successful msi-x initialization in RHEL-5 guest
Created attachment 614374 [details]
qemu-dm log (with debug patch) about failed msi-x initialization in RHEL-6 guest
I think this is the interesting part of the RHEL-5 --> RHEL-6 diff (made between comment 60 and comment 61). Of course I've missed a newline in my debug patch, I'll suplement it here. +pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff writable_mask=0000 +pt_msixctrl_reg_write: value=0002 dev_value=0002 +pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h +pt_msixctrl_reg_write: throughable_mask=ffff new_value=0002 +pt_msixctrl_reg_write: msix_enabled=0 +pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff writable_mask=0000 +pt_msixctrl_reg_write: value=c002 dev_value=0002 +pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h +pt_msixctrl_reg_write: throughable_mask=ffff new_value=c002 +pt_msixctrl_reg_write: msix_enabled=1 This is done only by RHEL-6; two invocations of pt_msixctrl_reg_write(). The first call ends up with new_value=0002, so "nothing happens", the second call ends up with new_value=c002 (has both PCI_MSIX_ENABLE=0x8000 and PCI_MSIX_MASK=0x4000 set). Referring back to comment 58, this means that msix->enabled will be set at the end of pt_msixctrl_reg_write(), but pt_msix_update() is *not* called. (A message saying "pt_msixctrl_reg_write: 1" should be present between "throughable_mask" and "msix_enabled".) pci_msix_writel: 1 Both RHEL-5 and RHEL-6 trigger pci_msix_writel() at this point. -pci_msix_writel: 2 -pci_msix_writel: 1 -pci_msix_writel: 1 -pci_msix_writel: 2 -pci_msix_writel: 1 -pci_msix_writel: 2 -pci_msix_writel: 1 -pci_msix_writel: 1 -pci_msix_writel: 2 -pci_msix_writel: 1 -pci_msix_writel: 2 RHEL-5 continues to massage this register. Values are not logged, unfortunately, but we can say that for some calls (pci_msix_writel: 2), the following block is triggered: if ( offset != 3 && entry->io_mem[offset] != val ) { PT_LOG("2\n"); entry->flags = 1; } which corresponds to "dev->msix->msix_entry[entry_nr].flags = 1". ("pci_msix_writel: 1" alone just logs entry to the function and changing "entry->io_mem[offset]".) +pci_msix_writel: 3 pci_msix_writel: 1 +pci_msix_writel: 3 pci_msix_writel: 1 -pci_msix_writel: 2 +pci_msix_writel: 3 RHEL-6 *instead* (not in addition) massages the following block: if ( offset == 3 ) { PT_LOG("3\n"); if ( msix->enabled && !(val & 0x1) ) { PT_LOG("4\n"); pt_msix_update_one(dev, entry_nr); } mask_physical_msix_entry(dev, entry_nr, entry->io_mem[3] & 0x1); } Note that "pci_msix_writel: 4" is never logged, thus pt_msix_update_one() is not called. "msix->enabled" must be "true" in RHEL-6, from above, but bit#0 is apparently set in "val". Until now we've set "dev->msix->msix_entry[entry_nr].flags" for the RHEL-5 guest, but have not for the RHEL-6 guest. pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff writable_mask=0000 -pt_msixctrl_reg_write: value=8002 dev_value=0002 +pt_msixctrl_reg_write: value=8002 dev_value=c002 pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h pt_msixctrl_reg_write: throughable_mask=ffff new_value=8002 pt_msixctrl_reg_write: 1 Both guest kernels write new_value=8002 to the control register (the previous value in the device is different, according to the guests' different pasts -- RHEL-5 has not touched the control register, RHEL-6 left c002 there). PCI_MSIX_ENABLE is set in the new value, but PCI_MSIX_MASK is clear, therefore pt_msixctrl_reg_write() calls pt_msix_update() for both guests: pt_msix_update: 1 ... Function entered and we're past an xc_physdev_set_device_msixtbl() hypercall... pt_msix_update: 2: 0 Starting loop that calls pt_msix_update_one() for each MSI-X entry. pt_msix_update_one: 1 Entry#0: pt_msix_update_one() is entered for both kernels. The first thing this function does is check entry->flags; if it's unset, we return early without doing anything. -pt_msix_update_one: 2 -pt_msix_update_one: 3 -pt_msix_update_one: now update msix entry 0 with pirq ff gvec b9 -pt_msix_update_one: 4 That's exactly what happens for RHEL-6 -- we've set entry->flags only for RHEL-5. pt_msix_update: 2: 1 pt_msix_update_one: 1 -pt_msix_update_one: 2 -pt_msix_update_one: 3 -pt_msix_update_one: now update msix entry 1 with pirq fe gvec c1 -pt_msix_update_one: 4 pt_msix_update: 2: 2 pt_msix_update_one: 1 -pt_msix_update_one: 2 -pt_msix_update_one: 3 -pt_msix_update_one: now update msix entry 2 with pirq fd gvec c9 -pt_msix_update_one: 4 Lather, rinse, repeat for entries #1 and #2. pt_msixctrl_reg_write: msix_enabled=1 After the loop completes in pt_msix_update(), we return to pt_msixctrl_reg_write(), set dev->msix->enabled, and we're done. For RHEL-6, - we fail to set "entry->flags" in pci_msix_writel(), - in the same function, we fail to call pt_msix_update_one() immediately (... even if we did, it wouldn't help: the latter function still depends on entry->flags) I think this is an MSI-X emulation bug in qemu-dm (= xen-userspace). RHEL-6's access pattern differs from that of RHEL-5, and we don't serve it correctly. Thoughts? Thanks. I think the three pt_msixctrl_reg_write() invocations in comment 61 (and comment 62), on behalf of RHEL-6, can be matched against the three PCI_MSIX_FLAGS write accesses in the guest kernel: ixgbevf_acquire_msix_vectors() [drivers/net/ixgbevf/ixgbevf_main.c] pci_enable_msix() [drivers/pci/msi.c] msix_capability_init() > static int msix_capability_init(struct pci_dev *dev, > struct msix_entry *entries, int nvec) > { > int pos, ret; > u16 control; > void __iomem *base; > > pos = pci_find_capability(dev, PCI_CAP_ID_MSIX); > pci_read_config_word(dev, pos + PCI_MSIX_FLAGS, &control); > > /* Ensure MSI-X is disabled while it is set up */ > control &= ~PCI_MSIX_FLAGS_ENABLE; > pci_write_config_word(dev, pos + PCI_MSIX_FLAGS, control); +pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff writable_mask=0000 +pt_msixctrl_reg_write: value=0002 dev_value=0002 +pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h +pt_msixctrl_reg_write: throughable_mask=ffff new_value=0002 +pt_msixctrl_reg_write: msix_enabled=0 > > /* Request & Map MSI-X table region */ > base = msix_map_region(dev, pos, multi_msix_capable(control)); > if (!base) > return -ENOMEM; > > ret = msix_setup_entries(dev, pos, base, entries, nvec); > if (ret) > return ret; > > ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX); > if (ret) > goto error; > > /* > * Some devices require MSI-X to be enabled before we can touch the > * MSI-X registers. We need to mask all the vectors to prevent > * interrupts coming in before they're fully set up. > */ > control |= PCI_MSIX_FLAGS_MASKALL | PCI_MSIX_FLAGS_ENABLE; > pci_write_config_word(dev, pos + PCI_MSIX_FLAGS, control); +pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff writable_mask=0000 +pt_msixctrl_reg_write: value=c002 dev_value=0002 +pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h +pt_msixctrl_reg_write: throughable_mask=ffff new_value=c002 +pt_msixctrl_reg_write: msix_enabled=1 (but masked) > > msix_program_entries(dev, entries); > > ret = populate_msi_sysfs(dev); > if (ret) { > ret = 0; > goto error; > } > > /* Set MSI-X enabled bits and unmask the function */ > pci_intx_for_msi(dev, 0); > dev->msix_enabled = 1; > > control &= ~PCI_MSIX_FLAGS_MASKALL; > pci_write_config_word(dev, pos + PCI_MSIX_FLAGS, control); pt_msixctrl_reg_write: emu_mask=0000 ro_mask=3fff valid_mask=ffff writable_mask=0000 -pt_msixctrl_reg_write: value=8002 dev_value=0002 +pt_msixctrl_reg_write: value=8002 dev_value=c002 pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h pt_msixctrl_reg_write: throughable_mask=ffff new_value=8002 pt_msixctrl_reg_write: 1 > > return 0; > > error: > if (ret < 0) { > /* > * If we had some success, report the number of irqs > * we succeeded in setting up. > */ > struct msi_desc *entry; > int avail = 0; > > list_for_each_entry(entry, &dev->msi_list, list) { > if (entry->irq != 0) > avail++; > } > if (avail != 0) > ret = avail; > } > > free_msi_irqs(dev); > > return ret; > } On the qemu-dm side, the mis-programming happens between the second and third PCI_MSIX_FLAGS config word accesses, so we should look at msix_program_entries() and pci_intx_for_msi() in the RHEL-6 guest kernel. I think I understand why the RHEL-5 guest works, but I don't understand how the RHEL-6 guest can work on the bare metal at all! :) So, RHEL-5 msix_capability_init() has calls like writel(address_lo, base + j * PCI_MSIX_ENTRY_SIZE + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET); writel(address_hi, base + j * PCI_MSIX_ENTRY_SIZE + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET); writel(data, base + j * PCI_MSIX_ENTRY_SIZE + PCI_MSIX_ENTRY_DATA_OFFSET); PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET == 0 PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET == 4 PCI_MSIX_ENTRY_DATA_OFFSET == 8 Now, after division by 4 (see pci_msix_writel() in qemu-dm) these become 0, 1, 2; therefore updates to the lower address offset, upper address offset, and data offset of any MSI-X entry kick the following in qemu-dm: if ( offset != 3 && entry->io_mem[offset] != val ) { entry->flags = 1; } Ie. such a change will indeed mark the MSI-X entry for update in the emulator, and once the PCI_MSIX_ENABLE flag is set, and the PCI_MSIX_MASK flag is cleared in the control register, a batched update will be flushed to the hypervisor. So what does the RHEL-6 kernel do? msix_capability_init() msix_program_entries() -- loops over all entries computes an offset for reading from PCI_MSIX_ENTRY_VECTOR_CTRL (12) readl() msix_mask_irq(entry, 1) __msix_mask_irq(desc, 1) computes same PCI_MSIX_ENTRY_VECTOR_CTRL offset writel() with lowest value bit set to "1" (from the second param) In the emulator (pci_msix_writel()), this vector control (?) write, 12/4==3, corresponds to if ( offset == 3 ) { if ( msix->enabled && !(val & 0x1) ) pt_msix_update_one(dev, entry_nr); mask_physical_msix_entry(dev, entry_nr, entry->io_mem[3] & 0x1); } Since the LSB is set, we don't do anything except masking. The entry is not marked for later update, and we don't update it right now. (a) I guess I could add an "else" branch to the above "is LSB set?" check, and if the LSB is in fact set, just say "entry->flags = 1"; ie. schedule a later update. But! (b) what I don't understand is this: *when* does the RHEL-6 kernel program: - PCI_MSIX_ENTRY_LOWER_ADDR == 0, - PCI_MSIX_ENTRY_UPPER_ADDR == 4, - PCI_MSIX_ENTRY_DATA == 8? *at all*? (Here I used the RHEL-6 macro names.) I grepped the RHEL-6 tree for them, but the only write accesses are in write_msi_msg_desc(). Possible call trees: __pci_restore_msi_state write_msi_msg() [drivers/pci/msi.c] write_msi_msg_desc() __pci_restore_msix_state write_msi_msg() [drivers/pci/msi.c] write_msi_msg_desc() arch_setup_msi_irqs() [arch/x86/kernel/apic/io_apic.c] setup_msi_irq() write_msi_msg() write_msi_msg_desc() I'll ignore the first two (they both come from pci_restore_state() -> pci_restore_msi_state(), which doesn't seem to be relevant). The third (arch_setup_msi_irqs()) is interesting though. ... We've seen it in comment 63, but msix_capability_init() calls it between the *first* and *second* accesses to PCI_MSIX_FLAGS (not between second & third), and qemu-dm logs nothing at all there. I'll have to add debug messages to the RHEL-6 guest kernel. (In reply to comment #12) > Created attachment 606050 [details] > rhel63 x64 xen hvm guest linux kernel dmesg log alloc irq_desc for 48 on node -1 <-----+ alloc kstat_irqs on node -1 <-----|--+ ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X <-----|--|--+ alloc irq_desc for 49 on node -1 <-----+ | | alloc kstat_irqs on node -1 <-----|--+ | ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X <-----|--|--+ alloc irq_desc for 50 on node -1 <-----+ | | alloc kstat_irqs on node -1 <-----|--+ | ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X <-----|--|--+ | | | arch_setup_msi_irqs() [arch/x86/kernel/apic/io_apic.c] | | | foreach MSI-X entry: | | | create_irq_nr() | | | irq_to_desc_alloc_node() ------+ | | init_one_irq_desc() | | init_kstat_irqs() ---------+ | setup_msi_irq() | write_msi_msg() | write_msi_msg_desc() | dev_printk() ------------+ Thus setup_msi_irq() positively calls write_msi_msg(), which calls write_msi_msg_desc(). Created attachment 614596 [details]
add debug messages to write_msi_msg_desc() -- debug patch for kernel-2.6.32-279.5.2.el6
Pasi, our repro env is gone for ten days; we'll have it back on Sep 30th, until Oct 7th. Until then, can you please - rebuild your RHEL-6.3.z guest kernel (2.6.32-279.5.2.el6) with the patch from comment 69 (*), - using this kernel, repeat your VF test (single VF would be best) and upload the guest dmesg (passing "ignore_loglevel" to the guest), - still with this kernel, repeat your PF test (comment 34) and upload the guest dmesg (passing "ignore_loglevel" to the guest). (*) I'm investigating how I can build source & binary RPMs that I'm at liberty to share with you as a customer. Until then, please add the debug patch to the spec file and rebuild the RPM. Thanks! Laszlo We've been re-granted access to another reproducer machine (dell-per820-02.lab.bos.redhat.com), and I think I managed to track the problem a bit further (configuring and passing through a single VF). After I booted the host, the PF (eth0) was not brought up automatically. I booted the guest in this state, with the debug patch from comment 69. This is what I saw in the guest dmesg: ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver - version 2.2.0-k ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation. ixgbevf 0000:00:06.0: setting latency timer to 64 ixgbevf 0000:00:06.0: PF still in reset state, assigning new address alloc irq_desc for 48 on node -1 alloc kstat_irqs on node -1 ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X alloc irq_desc for 49 on node -1 alloc kstat_irqs on node -1 ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X alloc irq_desc for 50 on node -1 alloc kstat_irqs on node -1 ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X However, no "PCI-MSI-edge" interrupts appeared in /proc/interrupts, ie. at this point I did not yet reproduce the original problem. So I tried to bring up the VF in the guest, with ifconfig. It was refused (link down), with the following message in the guest dmesg: ixgbevf: Unable to start - perhaps the PF Driver isn't up yet corresponding to what I've written at the beginning of this comment -- the PF had not been brought up in the host. Thus I did just that with ifconfig in the host, which succeeded. Then I retried upping the VF (called "rename2" due to udev magic) inside the guest. That succeeded too, with the following two symptoms: (a) the following appeared in /proc/interrupts: 48: 0 PCI-MSI-edge rename2-rx-0 49: 0 PCI-MSI-edge rename2-tx-0 50: 0 PCI-MSI-edge rename2:mbx (b) The following messages were logged *again* in the guest dmesg (produced by my debug patch in comment 69): ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now At this point I can repeatedly bring down & up the VF in the guest. The "down" operation produces a message like this in the host, for the PF: ixgbe 0000:08:00.0: eth0: VF Reset msg received from vf 0 (The passed-through VF is identified as 0000:08:10.0 in the host.) The "up" operation logs the guest messages under (b) each time I run it, and (a) remains unchanged. ----o---- Verdict: the write_msi_msg_desc() guest kernel function, which is a core PCI-MSI-X configuration function, elects not to touch the lower address / upper address / data registers (ie. actually configure MSI-X), because it finds that the VF PCI device is not in power saving state D0. D0 means "Fully-On" (see ACPISpec 5.0, 2.3 Device Power State Definitions): This state is assumed to be the highest level of power consumption. The device is completely active and responsive, and is expected to remember all relevant context continuously. "/sys/devices/pci0000:00/0000:00:03.0/0000:08:10.0/power/state" contains "0" on the host side. In the guest, "/sys/devices/pci0000:00/0000:00:06.0/power/wakeup" is the only file in that directory, and it has no contents. According to "Documentation/power/devices.txt", this means that the VF device and/or driver don't physically support wakeup events. As for the current power saving state of the device, I'm unable to locate it. RHEL-5's msix_capability_init() doesn't seem to care about the device's power state. ----o---- The branch in RHEL-6 write_msi_msg_desc() that makes MSI-X register access dependent on PCI power state comes from RHEL-6 commit 20a80eaa: [pci] MSI: Remove unsafe and unnecessary hardware access which has been made for bug 696511, first built in kernel-2.6.32-182.el6. Neighboring minor RHEL-6 releases: RHEL-6.1 2.6.32-131.el6 RHEL-6.2 2.6.32-220.el6 Therefore it might be considered a regression from RHEL-6.1. (In this BZ we have only checked RHEL-6.2, but that release already has the commit.) (CC'ing Don Zickus :)) ----o---- We have two choices here: - we could implement a guest kernel kludge whereby the PCI_D0 check is skipped for Xen HVM guests, - we could backport or fix PCI power state emulation in xen-userspace. Honestly, the thought of it freaks me out. Created attachment 615391 [details] rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with VF passthru Created attachment 615392 [details] rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with PF passthru (In reply to comment #71) > > - rebuild your RHEL-6.3.z guest kernel (2.6.32-279.5.2.el6) with the patch > from > comment 69 (*), > - using this kernel, repeat your VF test (single VF would be best) and upload > the guest dmesg (passing "ignore_loglevel" to the guest), Done. "rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with VF passthru". > - still with this kernel, repeat your PF test (comment 34) and upload the > guest > dmesg (passing "ignore_loglevel" to the guest). > Done. "rhel63 x64 xen hvm guest linux kernel dmesg log 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with PF passthru". ixgbevf_probe() pci_enable_device() __pci_enable_device_flags() do_pci_enable_device() pci_set_power_state(PCI_D0) -- errors returned by this func are fatal in do_pci_enable_device() *except* -EIO which is ignored __pci_start_power_transition() -- retval ignored pci_platform_power_transition() <------+ platform_pci_set_power_state() | pci_platform_pm -> set_state() | pci_update_current_state() | accesses PCI_PM_CTRL config word | pci_raw_set_power_state() | accesses PCI_PM_CTRL config word | pci_restore_bars() -- possibly | pcie_aspm_pm_state_change() -- possibly | __pci_complete_power_transition() | pci_platform_power_transition() ---- see here ----+ Ugh, this is a mess. In xen-userspace, there's a function called pt_pmcsr_reg_write(), "write Power Management Control/Status register", it could be the culprit. Andy, Stefan, is platform_pci_power_manageable() supposed to return true for ixgbevf? Also, is pci_dev.pm_cap nonzero for ixgbevf? ("PM capability offset in the configuration space".) Thanks! (In reply to comment #76) > Created attachment 615391 [details] > rhel63 x64 xen hvm guest linux kernel dmesg log > 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with VF passthru Confirms branch 1 in guest for VF: alloc irq_desc for 48 on node -1 alloc kstat_irqs on node -1 ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: irq 48 for MSI/MSI-X alloc irq_desc for 49 on node -1 alloc kstat_irqs on node -1 ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: irq 49 for MSI/MSI-X alloc irq_desc for 50 on node -1 alloc kstat_irqs on node -1 ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: irq 50 for MSI/MSI-X [...] ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now ixgbevf 0000:00:06.0: write_msi_msg_desc: don't touch the hardware now (In reply to comment #77) > Created attachment 615392 [details] > rhel63 x64 xen hvm guest linux kernel dmesg log > 2.6.32-279.5.2.el6.bz849223_debug.x86_64 with PF passthru Confirms branch 3 in guest for PF: alloc irq_desc for 48 on node -1 alloc kstat_irqs on node -1 ixgbe 0000:00:06.0: write_msi_msg_desc: branch 3: pos=80 msi_control_reg=0052 ixgbe 0000:00:06.0: read msgctl=0180 r=0 ixgbe 0000:00:06.0: wrote msgctl=0180 r=0 ixgbe 0000:00:06.0: msi_lower_address_reg=0054 address_lo=fee0100c r=0 ixgbe 0000:00:06.0: is-64: msi_upper_address_reg=0058 address_hi=00000000 msi_data_reg=005c data=00004161 r=0 r2=0 ixgbe 0000:00:06.0: irq 48 for MSI/MSI-X ixgbe 0000:00:06.0: write_msi_msg_desc: branch 3: pos=80 msi_control_reg=0052 ixgbe 0000:00:06.0: read msgctl=0181 r=0 ixgbe 0000:00:06.0: wrote msgctl=0181 r=0 ixgbe 0000:00:06.0: msi_lower_address_reg=0054 address_lo=fee0100c r=0 ixgbe 0000:00:06.0: is-64: msi_upper_address_reg=0058 address_hi=00000000 msi_data_reg=005c data=00004161 r=0 r2=0 ixgbe 0000:00:06.0: write_msi_msg_desc: branch 3: pos=80 msi_control_reg=0052 ixgbe 0000:00:06.0: read msgctl=0180 r=0 ixgbe 0000:00:06.0: wrote msgctl=0180 r=0 ixgbe 0000:00:06.0: msi_lower_address_reg=0054 address_lo=fee0100c r=0 ixgbe 0000:00:06.0: is-64: msi_upper_address_reg=0058 address_hi=00000000 msi_data_reg=005c data=00004169 r=0 r2=0 ixgbe 0000:00:06.0: irq 48 for MSI/MSI-X ixgbe 0000:00:06.0: write_msi_msg_desc: branch 3: pos=80 msi_control_reg=0052 ixgbe 0000:00:06.0: read msgctl=0181 r=0 ixgbe 0000:00:06.0: wrote msgctl=0181 r=0 ixgbe 0000:00:06.0: msi_lower_address_reg=0054 address_lo=fee0100c r=0 ixgbe 0000:00:06.0: is-64: msi_upper_address_reg=0058 address_hi=00000000 msi_data_reg=005c data=00004169 r=0 r2=0 Thanks for testing! (In reply to comment #75) > > The branch in RHEL-6 write_msi_msg_desc() that makes MSI-X register access > dependent on PCI power state comes from RHEL-6 commit 20a80eaa: > > [pci] MSI: Remove unsafe and unnecessary hardware access > > which has been made for bug 696511, first built in kernel-2.6.32-182.el6. > Neighboring minor RHEL-6 releases: > > RHEL-6.1 2.6.32-131.el6 > RHEL-6.2 2.6.32-220.el6 > > Therefore it might be considered a regression from RHEL-6.1. (In this BZ we > have only checked RHEL-6.2, but that release already has the commit.) > > (CC'ing Don Zickus :)) > I just tried with 6.1 kernel (2.6.32-131.0.15.el6.x86_64) but I'm seeing the same problem there. Zero interrupts for the VF IRQs and they're PCI-MSI-edge. Based on comment 80, the guest kernel does see PCI_D0 when the passed through device is a physical function; this check fails only for the VF. (The host-side PM control emulation, pt_pmcsr_reg_write(), is the same.) This makes me wonder if the root cause is in fact an ixgbevf or core pci driver issue. What I've seen while mapping the callgraph in comment 79 makes me think that devices/drivers not supporting actual power management should just fake the requested PCI_D0 state. See especially: - pci_platform_power_transition(), - pci_update_current_state(). Consider the following condition for pci_platform_power_transition(): !platform_pci_power_manageable(ixgbevf) && (dev->pm_cap != 0) --> dev->current_state will not be set to PCI_D0. Alternatively, platform_pci_power_manageable(ixgbevf) && platform_pci_set_power_state() < 0 --> pci_update_current_state() is not called, "dev->current_state" is not set. Setting needinfo wrt. the question at the end of comment 79. Thanks! :) (In reply to comment #81) > (In reply to comment #75) > > which has been made for bug 696511, first built in kernel-2.6.32-182.el6. > > Neighboring minor RHEL-6 releases: > > > > RHEL-6.1 2.6.32-131.el6 > > RHEL-6.2 2.6.32-220.el6 > > > > Therefore it might be considered a regression from RHEL-6.1. (In this BZ we > > have only checked RHEL-6.2, but that release already has the commit.) > > > > (CC'ing Don Zickus :)) > > > > I just tried with 6.1 kernel (2.6.32-131.0.15.el6.x86_64) but I'm seeing the > same problem there. Zero interrupts for the VF IRQs and they're PCI-MSI-edge. Not a regression then, but the cause preventing passthru ixgbevf from working in 6.1 might be different. (For example, ixgbevf has been updated to upstream version 2.2.0-k from 6.2 to 6.3; see comment 4.) Woah, see upstream linux commit b51306c6. Hmm, are you sure b51306c6 is the correct one? I can't find anything matching that from linus's linux.git or from google.. (In reply to comment #85) > Hmm, are you sure b51306c6 is the correct one? I can't find anything > matching that from linus's linux.git or from google.. Yes, it looks like a good candidate. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=b51306c6 I'll backport the patch & build a test kernel for you in a few hours. Created attachment 615482 [details]
[1/1] PCI: Set device power state to PCI_D0 for device without native PM support
Backport upstream Linux commit b51306c63449d7f06ffa689036ba49eb46e898b5,
minus the hunk reverting upstream Linux commit
47e9037ac16637cd7f12b8790ea7ce6680e42168, because we haven't backported
the latter.
---
drivers/pci/pci.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
Thanks for the link. No idea why I didn't find that.. not enough coffee :) And yes, good find, that definitely sounds like it could fix the problem! The patch in comment 87 fixes the problem for me. I had to pass through all ports to the guest (two in total, one virtual function per port, I think), but (a) now I can see the interrupt counters increasing: 48: 114 PCI-MSI-edge eth1-rx-0 49: 135 PCI-MSI-edge eth1-tx-0 50: 29 PCI-MSI-edge eth1:mbx (b) one port/VF has a live cable, the other not, and ifconfig in the guest can see that (SIOCSIFFLAGS: Network is down). With the patch from comment #87 I'm seeing the following: - Passthru 1 VF: The VF doesn't work, interrupt counters stay at zero. - Passthru 2 VFs: First VF works OK, the second VF doesn't, interrupt counters are zero for it. - Passthru 4 VFs: Only two first VFs are visible in "lspci" in the guest, and both of them fail - interrupt counters are zero for both of the VFs. In dom0 both PF ports are 'UP' and connected to a switch. So there's still something wrong.. (The "Passthru 4 VFs, only 2 visible in the guest" issue is probably another separate bug, should I open a new bug about that?) I tried to do some research; any input and/or corrections are welcome. (1) Per default, RHEL-5 qemu-dm supports at most two hotpluggable (passthrough) PCI devices. See vl.h: /* PCI slot 6~7 support ACPI PCI hot plug */ #define PHP_SLOT_START (6) #define PHP_SLOT_END (8) (2) Passed through functions (that may or may not share a device on the host side) show up as separate single-function devices (00:06.0, 00:07.0) in the guest. This may have been improved upstream (see eg. [1] [2]), but it's very unlikely we would touch this in RHEL-5. (3) Based on references [3], [4] and [5]: when passing through a function from a PCI device, you may have to pass through all functions of that device. An exception might be if the device supports FLR (function level reset). We can derive that - passing through more than two VFs probably won't work per default (2) (1), - configuring more than two VFs for the same NIC port, and then passing through at most two of those (ie. "not all") VFs will not work (3). From this point we should talk specific BDFs (bus-device-function triplets), using attachment 606045 [details] from comment 10. Passing through 03:00.0 (PF) on its own did work. I think it's due to "FLReset+" in the "DevCap" section. Same for 03:00.1. The ports (PFs) of the NIC share the host PCI device, but they support function level reset. The VFs (03:10.[0-7] for PF 03:00.0, and separately, 03:11.[0-7] for PF 03:00.1) do not support FLR. Therefore, if any 03:10.x is passed through, all existing, sibling VFs must be passed through to the same guest. The set of all VFs, for both ports together, controlled by the the max_vfs ixgbe option, must not consist of more than 2 elements per default, because of (1). Please test the following configuration with the comment 87 patch: - Module option for ixgbe: max_vfs=1 This should produce one VF per PF. - Guest passthrough stanza (and corresponding : pci = [ "0000:03:10.0", "0000:03:10.1" ] (or whatever BDF the ixgbe driver assigns to the single VF of each port.) Pciback should hide the same BDFs. - One NIC port should be accessible in the guest under 00:06.0 (see "ethtool -i ethX"), the other under 00:07.0. There's at least one way to lift the default limit of 2 on passed-through VFs. Please see bug 835768 comment 14 ("xen_emul_unplug=ide-disks" guest command line parameter, somewhat described in "Documentation/kernel-parameters.txt" too). Unplugging some emulated devices frees up guest BDFs for PCI passthrough. Hence please repeat the above test with "max_vfs=2" (and dependencies updated) in the host, and "xen_emul_unplug=ide-disks" specified in the guest. Thanks! [1] http://www.lca2010.org.nz/programme/schedule/view_talk/50048 [2] http://www.lca2010.org.nz/slides/50048.pdf [3] http://wiki.xen.org/wiki/Xen_PCI_Passthrough#I_get_.22Error:_pci:_0000:02:06.0_must_be_co-assigned_to_the_same_guest_with_0000:02:05.0.22_error_when_trying_to_start_the_guest [4] http://wiki.xen.org/wiki/Xen_PCI_Passthrough#How_can_I_check_if_PCI_device_supports_FLR_.28Function_Level_Reset.29_.3F [5] http://wiki.xen.org/wiki/Xen_PCI_Passthrough#passing_multiple_PCI_devices (In reply to comment #96) > I tried to do some research; any input and/or corrections are welcome. > > (1) Per default, RHEL-5 qemu-dm supports at most two hotpluggable > (passthrough) PCI devices. See vl.h: > > /* PCI slot 6~7 support ACPI PCI hot plug */ > #define PHP_SLOT_START (6) > #define PHP_SLOT_END (8) > Hmm, OK. That explains why I can see only 2 pass through devices in the VM. > (2) Passed through functions (that may or may not share a device on the host > side) show up as separate single-function devices (00:06.0, 00:07.0) in the > guest. This may have been improved upstream (see eg. [1] [2]), but it's very > unlikely we would touch this in RHEL-5. > Too bad :( > (3) Based on references [3], [4] and [5]: when passing through a function > from a PCI device, you may have to pass through all functions of that > device. > The whole point of SR-IOV is to be able to pass through VFs to different/multiple VMs.. I'm pretty certain this works properly in upstream Xen. I need to test/verify that. Also I'll try VFs with multiple RHEL5 PV domUs. > An exception might be if the device supports FLR (function level > reset). > The VFs have "FLReset-" in dom0.. that's weird. > > Please test the following configuration with the comment 87 patch: > Ok, will do. > > There's at least one way to lift the default limit of 2 on passed-through > VFs. Please see bug 835768 comment 14 ("xen_emul_unplug=ide-disks" guest > command line parameter, somewhat described in > "Documentation/kernel-parameters.txt" too). > > Unplugging some emulated devices frees up guest BDFs for PCI passthrough. > Hence please repeat the above test with "max_vfs=2" (and dependencies > updated) in the host, and "xen_emul_unplug=ide-disks" specified in the > guest. > Ok, will try this aswell. (In reply to comment #97) > (In reply to comment #96) > > > (3) Based on references [3], [4] and [5]: when passing through a function > > from a PCI device, you may have to pass through all functions of that > > device. > > > > The whole point of SR-IOV is to be able to pass through VFs to > different/multiple VMs.. I'm pretty certain this works properly in upstream > Xen. I need to test/verify that. > > Also I'll try VFs with multiple RHEL5 PV domUs. > I tried SR-IOV with the same RHEL 5.8 dom0, and the following RHEL 5.8 x64 guests all running simultaneously: - hvm01: 1 VF, works OK. - hvm02: 1 VF, works OK. - hvm03: 2 VFs, both VFs work OK. - pv01: 1 VF, works OK. - pv02: 1 VF, works OK. - pv03: 2 VFs, both VFs work OK. So 8x VFs total, assigned and spread among to 6x RHEL 5.8 VMs, everything working OK! It seems to me that there's a bug in the RHEL6 kernel that causes the following issues: - when 1 VF assigned it doesn't work at all, no interrupts received. - when 2 VFs assigned only one (first) of them works, the other VF doesn't receive any interrupts. Let me know if you want me to do further tests with RHEL6 guests. Thanks! Hi, (In reply to comment #99) > It seems to me that there's a bug in the RHEL6 kernel that causes the > following issues: > - when 1 VF assigned it doesn't work at all, no interrupts received. > - when 2 VFs assigned only one (first) of them works, the other VF doesn't > receive any interrupts. May I ask if these were precisely the two tests described in comment 96? If those tests do not work, I'd like to investigate more. Although I'm not sure how I'm going to debug them, since they worked in my test environment: - the max_vfs=1 case as per comment 92 - I just tested the max_vfs=2 case too, and it works as well. (Setup & results in next comment.) If those precise tests work in your environment, I'd like to post the patch internally and move this BZ to POST state. > Let me know if you want me to do further tests with RHEL6 guests. I think further problems should be reported as separate BZs. I've asked our Quality Engineering team about their VF passthrough test cases, in order to get a picture of what we support exactly. Thank you, Laszlo (In reply to comment #101) > May I ask if these were precisely the two tests described in comment 96? > > If those tests do not work, I'd like to investigate more. Although I'm not > sure how I'm going to debug them, since they worked in my test environment: > - the max_vfs=1 case as per comment 92 > - I just tested the max_vfs=2 case too, and it works as well. (Setup & > results in next comment.) Host (2.6.18-308.el5xen x86_64): grub entry: kernel /xen.gz-2.6.18-308.el5 dom0_mem=2048M iommu=1 loglvl=all \ guest_loglvl=all bootscrub=0 com1=115200,8n1 module /vmlinuz-2.6.18-308.el5xen ro root=/dev/VolGroup00/LogVol00 \ console=ttyS0,115200n81 pci_pt_e820_access=on ignore_loglevel /etc/modprobe.conf: options ixgbe max_vfs=2 options pciback \ hide="(0000:08:10.0)(0000:08:10.1)(0000:08:10.2)(0000:08:10.3)" /etc/modprobe.d/blacklist.conf: blacklist ixgbevf NIC: [root@dell-per820-02 ~]# ethtool -i eth0 driver: ixgbe version: 3.4.8-k firmware-version: 0.9-3 bus-info: 0000:08:00.0 [root@dell-per820-02 ~]# ethtool -i eth1 driver: ixgbe version: 3.4.8-k firmware-version: 0.9-3 bus-info: 0000:08:00.1 [root@dell-per820-02 ~]# ifconfig eth0 up [root@dell-per820-02 ~]# ifconfig eth1 up vm config: disk = [ "file:/var/lib/xen/images/guest.img,hda,w", ",hdc:cdrom,r" ] pci = [ "0000:08:10.0", "0000:08:10.1", "0000:08:10.2", "0000:08:10.3" ] Guest (2.6.32-279.5.2.el6.bz849223_pci_d0_Z x86_64): kernel cmdline: ... ignore_loglevel console=tty console=ttyS0,115200n81 \ xen_emul_unplug=ide-disks lspci: 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet \ Controller Virtual Function (rev 01) 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet \ Controller Virtual Function (rev 01) 00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet \ Controller Virtual Function (rev 01) 00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet \ Controller Virtual Function (rev 01) ethtool reports for eth1..eth4 (in this order): commonly driver: ixgbevf version: 2.2.0-k firmware-version: specifically bus-info: 0000:00:06.0 bus-info: 0000:00:07.0 bus-info: 0000:00:04.0 bus-info: 0000:00:03.0 when bringing them up with ifconfig (in this order), the host reports ixgbe 0000:08:00.0: eth0: VF Reset msg received from vf 0 ixgbe 0000:08:00.1: eth1: VF Reset msg received from vf 0 ixgbe 0000:08:00.1: eth1: VF Reset msg received from vf 1 ixgbe 0000:08:00.0: eth0: VF Reset msg received from vf 1 [root@dhcp47-109 ~]# grep PCI-MSI /proc/interrupts 48: 370 PCI-MSI-edge eth4-rx-0 49: 376 PCI-MSI-edge eth4-tx-0 50: 11 PCI-MSI-edge eth4:mbx 51: 371 PCI-MSI-edge eth3-rx-0 52: 377 PCI-MSI-edge eth3-tx-0 53: 11 PCI-MSI-edge eth3:mbx 54: 374 PCI-MSI-edge eth1-rx-0 55: 379 PCI-MSI-edge eth1-tx-0 56: 11 PCI-MSI-edge eth1:mbx 57: 373 PCI-MSI-edge eth2-rx-0 58: 378 PCI-MSI-edge eth2-tx-0 59: 11 PCI-MSI-edge eth2:mbx ... Then I repeated this same test, with the only change that pci = [ "0000:08:10.0" ] was specified in the vm config. The VF worked, interrupts kept increasing. I shut down the guest, removed pciback in dom0, and checked the four VFs for FLReset. None of them support FLR. I just tried with Fedora 17 HVM guests aswell: - F17 HVM with 1 VF assigned: works OK. - F17 HVM with 2 VFs assigned: both VFs work OK. I'll re-test with RHEL6 HVM guests. ( Side point, in reply to comment #97, > The VFs have "FLReset-" in dom0.. that's weird. I've just found this pearl in the xend source (commit e094c492 "Use PCIe FLR for VF of Intel 82599 10GbE Controller", bug 581655): # Quirk for the VF of Intel 82599 10GbE Controller. # We know it does have PCIe FLR capability even if it doesn't # report that (dev_cap.PCI_EXP_DEVCAP_FLR is 0). # See the 82599 datasheet. ) (In reply to comment #104) > ( > Side point, in reply to comment #97, > > > The VFs have "FLReset-" in dom0.. that's weird. > > I've just found this pearl in the xend source (commit e094c492 "Use PCIe FLR > for VF of Intel 82599 10GbE Controller", bug 581655): > > # Quirk for the VF of Intel 82599 10GbE Controller. > # We know it does have PCIe FLR capability even if it doesn't > # report that (dev_cap.PCI_EXP_DEVCAP_FLR is 0). > # See the 82599 datasheet. > ) That's a good find! It very well explains why I'm able to pass VFs to many/multiple VMs at the same time.. and that's how it's supposed to be :) More info soon.. New tests.. rhel 5.8 x64 dom0 (2.6.18-308.13.1.el5xen): - max_vfs=1 in /etc/modprobe.conf, other settings similar to comment #102. - Reboot the physical server before starting tests. - 2 VFs visible in dom0 lspci & bound to pciback. - "ifconfig ethX up" both PF ports. rhel 6.3 x64 hvm guest (2.6.32-279.5.2.el6.bz849223_pci_d0_Z.x86_64): - start the guest with 1 VF passed through. - Configure an IP to the VF eth-interface: "ifconfig ethX <ip> netmask <netmask> up". - Run "ethtool ethX" and verify "Link detected: yes". - Try pinging the default gateway IP, no replies. - Try running "tcpdump -i ethX -nn", no packets visible. - Notice how rx/tx packet counters stay at zero in "ifconfig ethX" output. - Notice how interrupt counters stay at zero in /proc/interrupts. - shutdown the guest and start it again. - repeat the tests and notice it still won't work. - shutdown the guest and start it again. - repeat the tests and notice it still won't work. Ok, so the VF doesn't work in rhel6 hvm guest (which had the pci_d0 patched kernel). Next I tried with rhel 5.8 x64 hvm guest (2.6.18-308.13.1.el5): - Start the guest with the same 1 VF passed through. - Notice the VF interface name is "__tmp254339888" inside the rhel 5.8 hvm guest. # ethtool -i __tmp254339888 driver: ixgbevf version: 2.1.0-k firmware-version: N/A bus-info: 0000:00:06.0 - Run "ifconfig __tmp254339888 <ip> netmask <netmask> up". - The console window of the rhel5 guest disappears and the guest kernel crashes, but the guest is still visible in "xm list" output. - I'll capture the guest kernel crash / stack trace and attach it later. - "xm destroy <rhel5hvm>". - start the rhel5 hvm guest again. - notice the VF interface is now actually called "ethX", like it should. - "ifconfig ethX <ip> netmask <netmask> up". - ping the gateway and notice the VF works OK. - Check the "ifconfig ethX" output and notice how rx/tx counters increase. - Check "/proc/interrupts" and notice how interrupt counters increase for the VF. - shutdown the rhel5 hvm guest and re-start it. - repeat the tests and notice the VF still works OK. - shutdown the rhel5 hvm guest and re-start it once again. - repeat the tests and notice the VF still works OK. So.. after trying to use the non-working rhel6 hvm guest the VF is left in some bad state, and rhel5 hvm guest crashes while trying to use it for the first time. On the second try the VF starts working OK in the rhel5 hvm guest. And then when the VF is working OK in the rhel5 hvm guest, it keeps working OK even if I reboot or shutdown + restart the rhel5 hvm guest multiple times. When the VF is in a working state (after running the rhel5.8 hvm guest) I tried booting into the rhel6 guest again, but the VF still fails there.. interrupt counters stay at zero, and the VF won't work. Next I passed the same VF to Fedora 17 HVM guest, and it works OK there. I also tried rebooting the physical server again, and starting the rhel5.8 hvm guest as the *first* guest - then the VF works immediately and I don't get any guest crashes. So the rhel5 guest kernel crash I'm seeing is related to rhel6 hvm guest leaving the VF to some bad state. Summary: - 1 VF works OK in rhel5 PV guest. - 1 VF works OK in rhel5 HVM guest. - 1 VF works OK in F17 HVM guest. - 1 VF doesn't work in RHEL6 HVM guest, interrupt counters stay at zero. Created attachment 616740 [details]
rhel58 x64 xen hvm guest ixgbevf_msix_clean_tx crash log stack trace
this is the rhel 5.8 hvm guest kernel crash that I get when trying to use SR-IOV VF that has been put into some kind of "bad state" by the non-working rhel6.3 hvm guest.
(In reply to comment #101) > > If those precise tests work in your environment, I'd like to post the patch > internally and move this BZ to POST state. > The patch helps (and thus is probably needed), but unfortunately it doesn't fix all the problems in my system. The remaining problems with the rhel 6.3 hvm guest are: - passthrough 1 VF: the VF doesn't work, interrupt counters stay at zero. - passthrough 2 VFs: Only the first VF works, the second VF doesn't - the interrupt counters stay at zero. And like already mentioned these problems are RHEL6.3 guest specific - RHEL5.8 PV, RHEL5.8 HVM and F17 HVM guests do work OK for both of those test cases. Thanks a lot for all the help! I'll build a guest kernel with both attachment 615482 [details] and attachment 614596 [details]. Created attachment 617217 [details] rhel63 x64 xen hvm guest with 1 vf does not work 2.6.32-279.5.2.el6.bz849223_pci_d0_dbg Created attachment 617218 [details] rhel63 x64 xen hvm guest with 2 vfs only first vf works 2.6.32-279.5.2.el6.bz849223_pci_d0_dbg (In reply to comment #109) > I'll build a guest kernel with both attachment 615482 [details] and > attachment 614596 [details]. Thanks. New guest dmesg logs attached with kernel 2.6.32-279.5.2.el6.bz849223_pci_d0_dbg. - 1 VF passthrough: the VF doesn't work, interrupt counters stay at zero. - 2 VF passthrough: the first VF works OK. The second VF doesn't work, interrupt counters stay at zero for the second VF. Thanks! Thanks. It's interesting that for PFs, branch 3 is invoked (comment 80), while for VFs, branch 2. Created attachment 617260 [details]
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with 1 vf
Created attachment 617261 [details]
rhel58 x64 xen qemu-dm log for rhel63 x64 hvm guest with 2 vfs
I attached qemu-dm logs for both 1vf and 2vfs testcases. If you take a look at the 2vfs case (where the first vf works, and the second vf doesn't), you can see these differences: First VF: pt_pci_read_config: Warning: Return ALL F from libpci read. [00:06.0][Offset:00h][Length:4] pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h pt_msix_update_one: now update msix entry 0 with pirq ff gvec 59 pt_msix_update_one: now update msix entry 1 with pirq fe gvec 61 pt_msix_update_one: now update msix entry 2 with pirq fd gvec 69 Second VF: pt_pci_read_config: Warning: Return ALL F from libpci read. [00:07.0][Offset:00h][Length:4] pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h pci_msix_writel: can not update msix entry 0 since MSI-X is already function now. pci_msix_writel: can not update msix entry 0 since MSI-X is already function now. pci_msix_writel: can not update msix entry 0 since MSI-X is already function now. pci_msix_writel: can not update msix entry 1 since MSI-X is already function now. pci_msix_writel: can not update msix entry 1 since MSI-X is already function now. pci_msix_writel: can not update msix entry 1 since MSI-X is already function now. pci_msix_writel: can not update msix entry 2 since MSI-X is already function now. pci_msix_writel: can not update msix entry 2 since MSI-X is already function now. pci_msix_writel: can not update msix entry 2 since MSI-X is already function now. .. and the test case with 1 vf passed through doesn't have *any* mention of either "pci_msix_writel" or "pt_msix_update_one" .. (In reply to comment #120) > .. and the test case with 1 vf passed through doesn't have *any* mention of > either "pci_msix_writel" or "pt_msix_update_one" .. Yes, let's focus on this one first. The corresponding kernel log (comment 113) testifies about three batches of MSI-X updates. The batches are identical. I can't see any reason to repeat the batch. The ixgbevf driver doesn't seem to do it in a loop. I have a theory involving module removal, based on what I've read on the net. udev is going crazy renaming these interfaces. In the mailing list thread someone claimed that udev modifies its persistent net rules and then reloads the driver at rename time. This would certainly explain the multiple batches of MSI-X initialization. If module removal does not tear down MSI-X to qemu-dm's liking, it could be an explanation. I shall extend the kernel debug patch with WARN invocations (in order to get stackdumps). I'll also upload a set of xen packages with more logging (to the tune of attachment 614372 [details]). (In reply to comment #119) > First VF: > [pt_msixctrl_reg_write x 3] > > Second VF: > [pt_msixctrl_reg_write x 3] > [pci_msix_writel x 9] The distribution of these log entries between the two VFs seems a bit different actually. The guest log in comment 114 has 15 entries (5 batches) of MSI-X setup, matching qemu-dm.log as follows: msix_capability_init() call for VF1 (see comment 63 & comment 65): > pt_pci_read_config: Warning: Return ALL F from libpci read. > [00:06.0][Offset:00h][Length:4] > pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h 1st "write_msi_msg_desc: branch 2" batch for 00:06.0 (VF 1) happens here. > pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h > pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h > pt_msix_update_one: now update msix entry 0 with pirq ff gvec 59 > pt_msix_update_one: now update msix entry 1 with pirq fe gvec 61 > pt_msix_update_one: now update msix entry 2 with pirq fd gvec 69 msix_capability_init() call for VF2: > pt_pci_read_config: Warning: Return ALL F from libpci read. > [00:07.0][Offset:00h][Length:4] > pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h 1st "write_msi_msg_desc: branch 2" batch for 00:07.0 (VF 2) happens here, with no effect. > pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h > pt_msixctrl_reg_write: old_ctrl:0002h new_ctrl:0002h The following updates don't come from msix_capability_init() -- no control register access: > pci_msix_writel: can not update msix entry 0 since MSI-X is already > function now. > pci_msix_writel: can not update msix entry 0 since MSI-X is already > function now. > pci_msix_writel: can not update msix entry 0 since MSI-X is already > function now. > pci_msix_writel: can not update msix entry 1 since MSI-X is already > function now. > pci_msix_writel: can not update msix entry 1 since MSI-X is already > function now. > pci_msix_writel: can not update msix entry 1 since MSI-X is already > function now. > pci_msix_writel: can not update msix entry 2 since MSI-X is already > function now. > pci_msix_writel: can not update msix entry 2 since MSI-X is already > function now. > pci_msix_writel: can not update msix entry 2 since MSI-X is already > function now. This is the 2nd batch for VF1, and the 2nd and 3rd batches for VF2, all interleaved. They are triggered by write_msi_msg_desc(), which could be called from __pci_restore_msix_state() (see comment 65). Created attachment 617528 [details]
write_msi_msg_desc() debug messages (v2), now with WARN
Created attachment 617541 [details]
qemu-dm debug messages (v2), with PCI BDFs
for xen-3.0.3-135.el5_8.5
trailing whitespace was stripped from "tools/ioemu/hw/pass-through.c" first
Pasi, can you please do the following tests? +-------------------------+-------------+---------------------------+ | host (pciback in sync) | max_vfs=1 | max_vfs=2 | +-------------------------+-------------+-------------+-------------+ | # of passed-through VFs | 1 | 1 | 2 | +-------------------------+------+------+------+------+------+------+ | guest | 5.8 | 6.3 | 5.8 | 6.3 | 5.8 | 6.3 | +-------------------------+------+------+------+------+------+------+ | results thus far (with | pass | FAIL | pass | FAIL | pass | FAIL | | comment reference) | c106 | c106 | c108 | c113 | c108 | c114 | +-------------------------+------+------+------+------+------+------+ | requesting qemu-dm log | c124 | c124 | c124 | c124 | c124 | c124 | +-------------------------+------+------+------+------+------+------+ | requesting guest log | base | c123 | base | c123 | base | c123 | +-------------------------+------+------+------+------+------+------+ 6 tests, 12 log files; please feel free to upload the logs in a tarball. Please - reboot the host between the tests, - specify ignore_loglevel everywhere, - guests are all HVM. Thanks! ... and please describe the end result of each test, ie. whether all, or some, or no VFs work. Thanks! Created attachment 617718 [details]
sr-iov vf passthru test results to el5.8 and el6.3 hvm guests
+-------------------------+-------------+---------------------------+ | host (pciback in sync) | max_vfs=1 | max_vfs=2 | +-------------------------+-------------+-------------+-------------+ | # of passed-through VFs | 1 | 1 | 2 | +-------------------------+------+------+------+------+------+------+ | guest | 5.8 | 6.3 | 5.8 | 6.3 | 5.8 | 6.3 | +-------------------------+------+------+------+------+------+------+ | results of the test | pass | FAIL | pass | FAIL | pass | PART | +-------------------------+------+------+------+------+------+------+ Notes about the test results: - "FAIL": VF doesn't work, no interrupts received for the VF in /proc/interrupts. - "PART": Partial success; the first VF works OK, the second VF fails and doesn't get any interrupts in /proc/interrupts. - "pass": VF interface has a name like "__tmp1960421532" in the el5.8 guest, and doing "ifconfig __tmp1960421532 up" crashes the guest kernel. If the guest has 2 VFs passed through then this happens only for the first VF. After restarting the el5.8 guest it works OK without problems. The crash log/stack trace is in comment #107. So in all el5.8 guest tests I had to reboot/restart the guest once first before doing the actual test and capturing the dmesg log. And I forgot to mention about "pass" for the el5.8 guest with 2 VFs passed through: both of the VFs worked OK (after restarting the guest once). And I rebooted the physical server 6 times; once before every test. I think I might have found a clue. $ grep squash * max_vfs1-1vf-el63-qemu-dm-log.txt:squash iomem [f4024000, f4024030). max_vfs2-1vf-el63-qemu-dm-log.txt:squash iomem [f4024000, f4024030). max_vfs2-2vf-el63-qemu-dm-log.txt:squash iomem [f402c000, f402c030). The squashed iomem region is exactly the one that is used for programming the *last VF* passed through. This is the reason why there are no pci_msix_writel() messages for the last VF passed through, ie. why a buffered update is not prepared and then flushed -- even though the guest issues those writes by now (due to the PCI_D0 patch), qemu-dm simply doesn't have a handler for the range. In the max_vfs=2, rhel63 guest, 1vf->2vf transition, the squashed iomem "moves" (see above), and in the second case the pci_msix_writel() messages show up for the first VF, now that the squashed region "moved over" to the range belonging to the 2nd (= last) VF. In the qemu-dm logs for the 5.8 guests, there are no such messages. In the qemu-dm logs for the 6.3 guests, the message always shows up in such a block: +Unknown PV product 3 loaded in guest +PV driver build 1 +region type 0 at [f4000000,f4020000). +squash iomem [f4024000, f4024030). +region type 1 at [c200,c240). The "squash iomem" message is printed on the following call path: platform_fixed_ioport_write2 [tools/ioemu/hw/xen_platform.c] pci_unplug_netifs [tools/ioemu/hw/pci.c] unregister_iomem [tools/ioemu/target-i386-dm/exec-dm.c] The platform_fixed_ioport_write2() --> pci_unplug_netifs() call depends on UNPLUG_ALL_NICS, which I think is something that the RHEL-6 guest requests. ... Confirmed, see xen_unplug_emulated_devices() [arch/x86/xen/platform-pci-unplug.c] in the RHEL-6 kernel. In absence of the "xen_emul_unplug" command line parameter, a default value is used, which is composed to have XEN_UNPLUG_ALL_NICS. (See the rhel63 guest dmesgs, all three contain the "unplug emulated NICs" message.) The RHEL-5 guest has no xen_emul_unplug support. pci_unplug_netifs() iterates over all netifs, but it will not unplug one if test_pci_slot() returns 1 for it. I'll have to dig deeper into that test. "xen_emul_unplug=never" (or "xen_emul_unplug=ide-disks" too, see comment 96) prevents this, but I think qemu-dm should not unplug the VF. We backported that check for bug 665032. Our version of test_pci_slot() will happily allow pci_unplug_netifs() to unplug any ethernet device not in { 00:06.0, 00:07.0 }. When we want to pass through more than two PCI devs, we have to use xen_emul_unplug=..., so that the UNPLUG_ALL_NICS default is not in effect, and we don't reach pci_unplug_netifs(). When we pass through <= 2 devices, test_pci_slot() nonetheless returns 0 for the last one, its dpci_infos.php_devs[php_slot].valid entry is "false". That flag is set to 1 in __insert_to_pci_slot(), and I don't see why, based on the qemu-dm logs and register_real_device(). I'll have to extend the debug patch. ... commits ea4860c1 and f3460ff8 from <git://xenbits.xen.org/qemu-xen-unstable.git> seem somewhat relevant, but I think they're too intrusive. (In reply to comment #133) > In the qemu-dm logs for the 6.3 guests, the message always shows up in such > a block: > > +Unknown PV product 3 loaded in guest > +PV driver build 1 > +region type 0 at [f4000000,f4020000). > +squash iomem [f4024000, f4024030). > +region type 1 at [c200,c240). I re-checked an older qemu-dm log from a RHEL-6.3 guest here, from comment 61. (At that time we were still looking for PCI_D0 in the guest, but it's irrelevant wrt. "squash iomem" in qemu-dm.) It only has Unknown PV product 3 loaded in guest PV driver build 1 no "squash iomem". I've reviewed some others as well: - comment 32 : 1 VF, 1 squash - comment 38 : 4 VFs, 3 squashes (although only two squashes match VFs) - comment 117: 1 VF, 1 squash (guest had PCI_D0 patch) - comment 118: 2 VFs, 1 squash (ditto) qemu-dm works differently in our respective environments in this regard. So hmm.. do you want me to try something? disable unplug on the guest cmdline, perhaps? Thanks. Yes, if you could re-run the three 6.3 guest tests (same RPMs as last time), with "xen_emul_unplug=ide-disks" (or "xen_emul_unplug=never") on the guest cmdline, that would be great, just to verify the theory. But I'll instrument qemu-dm some more and upload a new build shortly. Created attachment 618009 [details]
qemu-dm debug messages (v3), track "valid" flag too
Pasi, (In reply to comment #140) > Created attachment 618009 [details] > qemu-dm debug messages (v3), track "valid" flag too please repeat the "max_vfs1-1vf-el63" test with this Xen package. (Reusing the guest kernel from comment 123, ie. from the most recent tests.) Please make sure that "xen_emul_unplug" is absent from the guest cmdline. Thanks! (In reply to comment #139) > Yes, if you could re-run the three 6.3 guest tests (same RPMs as last time), > with "xen_emul_unplug=ide-disks" (or "xen_emul_unplug=never") on the guest > cmdline, that would be great, just to verify the theory. > I just quickly did the "max_vfs1_1vf_el63" test with "xen_emul_unplug=never" on the guest kernel cmdline, and now the VF works in the guest !! (I used the previous version of xen - I'll try the latest instrumented xen soon). Created attachment 618145 [details] sr-iov vf passthru to el6.3 test results for comment #142 (In reply to comment #142) > Pasi, > > (In reply to comment #140) > > Created attachment 618009 [details] > > qemu-dm debug messages (v3), track "valid" flag too > > please repeat the "max_vfs1-1vf-el63" test with this Xen package. (Reusing > the guest kernel from comment 123, ie. from the most recent tests.) > > Please make sure that "xen_emul_unplug" is absent from the guest cmdline. > > Thanks! Done and logs uploaded. Now the VF didn't work in the el6.3 guest, as expected. I didn't use xen_emul_unplug. (In reply to comment #145) > Done and logs uploaded. Now the VF didn't work in the el6.3 guest, as > expected. > I didn't use xen_emul_unplug. Thanks, we're getting closer. pci_unplug_netifs: x=32: ethernet controller test_pci_slot: 1: slot=4 region type 0 at [f4000000,f4020000). squash iomem [f4024000, f4024030). region type 1 at [c200,c240). This log segment is generated when qemu-dm unplugs the emulated NIC, 00:04.0. pci_unplug_netifs: x=48: ethernet controller test_pci_slot: 1: slot=6 test_pci_slot: 2: php_slot=0 valid=1 This log segment is generated when qemu-dm (pci_unplug_netifs()) investigates and correctly skips (ie. does not unplug) the VF, 00:06.0. The iomem required to set up MSI-X for the VF (00:06.0) is squashed when qemu-dm (correctly) unplugs the emulated card (00:04.0). "Region type 0 at [f4000000, f4020000)" is from #define PNPMMIO_SIZE 0x20000 in [tools/ioemu/hw/e1000.c] -- see "vif = [ '..., model=e1000' ]" in comment 0. Note that these memory ranges don't overlap: region type 0 at [f4000000,f4020000) <--- e1000 squash iomem [f4024000, f4024030) <--- ixgbevf The bug is in unregister_iomem() [tools/ioemu/target-i386-dm/exec-dm.c]. See the following two references: http://xenbits.xen.org/gitweb/?p=qemu-xen-unstable.git;a=commitdiff;h=8cc8a365 http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1805 I'll build a xen package with that patch backported soon. Created attachment 618240 [details]
[1/2] Backport a single hunk from qemu-xen-unstable commit 13669683
commit 13669683830d4508b6c8ed87de088785fa95ed3c
Author: Ian Jackson <ian.jackson.com>
Date: Mon Mar 16 13:47:18 2009 +0000
Post-merge compilation fixes
Signed-off-by: Ian Jackson <ian.jackson.com>
as a dependency for the next patch.
---
tools/ioemu/target-i386-dm/exec-dm.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
Created attachment 618241 [details] [2/2] qemu-dm: fix unregister_iomem() Backport of qemu-xen-unstable... commit 8cc8a3651c9c5bc2d0086d12f4b870fc525b9387 Author: Jan Beulich <JBeulich> Date: Tue Feb 7 18:42:56 2012 +0000 This function (introduced quite a long time ago in e7911109f4321e9ba0cc56a253b653600aa46bea - "disable qemu PCI devices in HVM domains") appears to be completely broken, causing the regression reported in http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1805 (due to the newly added caller of it in 56d7747a3cf811910c4cf865e1ebcb8b82502005 - "qemu: clean up MSI-X table handling"). It's unclear how the function can ever have fulfilled its purpose: the value returned by iomem_index() is *not* an index into mmio[]. Additionally, fix two problems: - unregister_iomem() must not clear mmio[].start, otherwise cpu_register_physical_memory() won't be able to re-use the previous slot, thus causing a leak - cpu_unregister_io_memory() must not check mmio[].size, otherwise it won't properly clean up entries (temporarily) squashed through unregister_iomem() Signed-off-by: Jan Beulich <jbeulich> Tested-by: Stefano Stabellini <stefano.stabellini.com> Tested-by: Yongjie Ren <yongjie.ren> --- tools/ioemu/target-i386-dm/exec-dm.c | 12 ++++++++---- 1 files changed, 8 insertions(+), 4 deletions(-) Pasi, as I wrote in my email, please - pick a guest kernel with the PCI_D0 patch in comment 87 / comment 92, - optionally with the v2 debug patch in comment 123, and - pick a xen userspace with the series in comment 147 - comment 148, - optionally with the v3 debug patch in comment 140. Then please repeat the three 6.3 test from comment 129: - please reboot the host again between tests, - do not specify the xen_emul_unplug cmdline param in the guest. Thanks! New tests with PCI_D0 patched and debug enabled el6.3 guest kernel (87+92+123) and with patched + debug-enabled (140+147+148) xen/qemu-dm rpms: +-------------------------+-------------+---------------------------+ | host (pciback in sync) | max_vfs=1 | max_vfs=2 | +-------------------------+-------------+-------------+-------------+ | # of passed-through VFs | 1 | 1 | 2 | +-------------------------+------+------+------+------+------+------+ | guest | 5.8 | 6.3 | 5.8 | 6.3 | 5.8 | 6.3 | +-------------------------+------+------+------+------+------+------+ | results of the test | pass | pass | pass | pass | pass | pass | +-------------------------+------+------+------+------+------+------+ Both rhel5.8 and rhel6.3 HVM guests work OK now ! (ok, almost, rhel5 still has the weird kernel crash during the first time the VM is started, but that's a separate issue, and I'll file a separate bug about that). So it looks like solving this bug needs: - rhel6 kernel patch for the PCI_D0 issue. - rhel5 xen qemu-dm patch for the nic unplug / iomem issue. Thanks a lot ! Created attachment 618312 [details] sr-iov vf passthru test results for comment 150 to el5.8 and el6.3 hvm guests Great job Laszlo, and thanks Pasi for the collaboration!!! I don't think we need to fix RHEL5 passthrough though. I cloned this bug to bug 861349 for the RHEL5 qemu-dm fix, and requested an exception for RHEL5.9. Thank you for the persistent testing! I've cloned this BZ to bug 861352 for the userspace patch. That was a great race condition, 861352 - 861349 = 3 :) Since Paolo was first, I'm closing my clone as a duplicate of his. Thanks a lot to everyone involved, it took a while and a lot of testing, but luckily it's figured out now :) btw will there be another bugzilla id for the rhel6 kernel pci_d0 patch? (In reply to comment #156) > btw will there be another bugzilla id for the rhel6 kernel pci_d0 patch? I posted the pci_d0 patch with reference to this BZ (bug 849223), and the short qemu-dm series for the clone bug 861349. Originally we couldn't decide if this BZ should belong to RHEL-6, component kernel, or RHEL-5, component xen. Ultimately both had to be modified. By the time we figured it out, I had moved this bug to RHEL-6, component kernel (see comment 50 and comment 51, and click the History link and look for the Component/Version change), so the clone was made for RHEL-5, component xen. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. (In reply to comment #157) > (In reply to comment #156) > > > btw will there be another bugzilla id for the rhel6 kernel pci_d0 patch? > > I posted the pci_d0 patch with reference to this BZ (bug 849223), and the > short qemu-dm series for the clone bug 861349. > > Originally we couldn't decide if this BZ should belong to RHEL-6, component > kernel, or RHEL-5, component xen. Ultimately both had to be modified. By the > time we figured it out, I had moved this bug to RHEL-6, component kernel > (see comment 50 and comment 51, and click the History link and look for the > Component/Version change), so the clone was made for RHEL-5, component xen. > Yep, makes sense. Thanks again for the big amount of debugging / instrumenting / research work for this bug ! FYI: I added the RHEL5.8 HVM guest "kernel crash when running ifup" issue as a separate bug #862862: https://bugzilla.redhat.com/show_bug.cgi?id=862862 Patch(es) available on kernel-2.6.32-318.el6 This bug reproduced on the same machine, verify it with: Version: Host(RHEL5.9): - kernel version: 2.6.18-343.el5xen - Xen version: xen-3.0.3-142.el5 - machine/CPU: dell-per510/Intel Xeon Guest(RHEL6.4): - Kernel version: 2.6.32-335 Steps: 1. enable VFs in host 2. assign VFs to guest 3. ping each vf of guest from host Results: [in guest] [root@dhcp-8-202 ~]# lspci | grep 82599 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:08.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:09.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:0a.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:0b.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:0c.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:0d.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:0e.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:0f.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:11.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:12.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) [in host] ping each vf of guest successfully from host Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html |