Description of problem: When assign broadcom network card to rhel5 PV guest via pci pass-through, bnx2 driver cannot be loaded and bond to the pci device. So the network card cannot be used within the guest. Version-Release number of selected component (if applicable): guest: 2.6.18-274.el5xen host: 2.6.18-274.el5xen How reproducible: Always Steps to Reproduce: 1. In the host, detach the network card from domain0 and bind it to pciback driver. 2. Start the PV guest with the network card assigned. 3. After guest boots, we could see the pass-throughed pci device(network card), but the bnx2 driver cannot be loaded. In the guest: # lspci 00:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 00:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) # dmesg | grep -i bnx bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.0.21 (Dec 23, 2010) bnx2 0000:00:00.0: Cannot map register space, aborting bnx2: probe of 0000:00:00.0 failed with error -12 bnx2 0000:00:00.1: Cannot map register space, aborting bnx2: probe of 0000:00:00.1 failed with error -12 bnx2 0000:00:00.0: Cannot map register space, aborting bnx2: probe of 0000:00:00.0 failed with error -12 Actual results: bnx2 driver cannot be bond to pci device(Broadcom network card) assigned to the PV guest. Thus the network card cannot be used in the guest. Expected results: network card can be used within the guest. Additional info: 1. igb driver works well with the same scenario. 2. config file of the guest: name = "test-6" memory = "1024" vcpus = 1 disk = [ 'phy:/dev/xenvg1/test-6,sda1,w'] bootloader="/usr/bin/pygrub" on_reboot = 'restart' on_crash = 'restart' pci = ['0000:01:00.1','0000:01:00.0'] 3. lspci -v output in the guest: # lspci -v 00:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Subsystem: Dell PowerEdge R710 BCM5709 Gigabit Ethernet Flags: fast devsel, IRQ 23 Memory at d4000000 (64-bit, non-prefetchable) [disabled] [size=32M] Capabilities: [48] Power Management version 3 Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable- Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9 Capabilities: [ac] Express Endpoint IRQ 0 00:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Subsystem: Dell PowerEdge R710 BCM5709 Gigabit Ethernet Flags: fast devsel, IRQ 24 Memory at d6000000 (64-bit, non-prefetchable) [disabled] [size=32M] Capabilities: [48] Power Management version 3 Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable- Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9 Capabilities: [ac] Express Endpoint IRQ 0
Created attachment 521564 [details] xm dmesg output
dmesg output of domain0 after binding pci device to pciback driver: pciback: vpci: 0000:01:00.1: assign to virtual slot 0 pciback: vpci: 0000:01:00.0: assign to virtual slot 0 func 0 blkback: ring-ref 9, event-channel 7, protocol 1 (x86_64-abi) pciback 0000:01:00.0: Driver tried to write to a read-only configuration space field at offset 0x4c, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:01:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 36 (level, low) -> IRQ 23 PCI: Setting latency timer of device 0000:01:00.0 to 64 ACPI: PCI interrupt for device 0000:01:00.0 disabled pciback 0000:01:00.1: Driver tried to write to a read-only configuration space field at offset 0x4c, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:01:00.1 (0000 -> 0002) ACPI: PCI Interrupt 0000:01:00.1[B] -> GSI 48 (level, low) -> IRQ 24 PCI: Setting latency timer of device 0000:01:00.1 to 64 ACPI: PCI interrupt for device 0000:01:00.1 disabled ACPI: PCI interrupt for device 0000:01:00.1 disabled ACPI: PCI interrupt for device 0000:01:00.0 disabled pciback: vpci: 0000:01:00.1: assign to virtual slot 0 pciback: vpci: 0000:01:00.0: assign to virtual slot 0 func 0 blkback: ring-ref 9, event-channel 7, protocol 1 (x86_64-abi) PCI: Enabling device 0000:01:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 36 (level, low) -> IRQ 23 PCI: Setting latency timer of device 0000:01:00.0 to 64 ACPI: PCI interrupt for device 0000:01:00.0 disabled PCI: Enabling device 0000:01:00.1 (0000 -> 0002) ACPI: PCI Interrupt 0000:01:00.1[B] -> GSI 48 (level, low) -> IRQ 24 PCI: Setting latency timer of device 0000:01:00.1 to 64 ACPI: PCI interrupt for device 0000:01:00.1 disabled PCI: Enabling device 0000:01:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 36 (level, low) -> IRQ 23 PCI: Setting latency timer of device 0000:01:00.0 to 64 ACPI: PCI interrupt for device 0000:01:00.0 disabled
QE can reproduce this issue with RHEL5.7 PV x86_64 guest, the same nic can be pass-through to RHEL5.7 x86_64 HVM guest and works within the guest.
Hi Yufang, is this a regression?
(In reply to comment #4) > is this a regression? Not a regression, can be reproduced with RHEL5.5 and 5.6 pv guests.
bnx2_init_board() [drivers/net/bnx2.c] -> ioremap_nocache() [arch/i386/mm/ioremap-xen.c] [...] Found a perhaps relevant report on xen-devel: http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00244.html Can you try that? Also, could you try writing the full PCI identifier to "/sys/bus/pci/drivers/pciback/permissive" before starting the guest? permissive_add() [drivers/xen/pciback/pci_stub.c] -> str_to_slot() -> sscanf(buf, " %x:%x:%x.%x", domain, bus, slot, func) I'll try to get a Beaker machine with two NICs, one of them being bnx2 and test.
Thanks Qixiang, this worked; I managed to reproduce the problem. Guest said: bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.0.21 (Dec 23, 2010) PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 bnx2 0000:00:00.0: Cannot map register space, aborting bnx2: probe of 0000:00:00.0 failed with error -12 PCI: Enabling device 0000:00:00.1 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.1 to 64 bnx2 0000:00:00.1: Cannot map register space, aborting bnx2: probe of 0000:00:00.1 failed with error -12 HV said (log possibly truncated): (XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 00000000 (XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 00000000 (XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 000da00c (XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 000dc00c dom0 said (twice): pciback 0000:01:00.0: Driver tried to write to a read-only configuration space field at offset 0x4c, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:01:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:01:00.0 to 64 ACPI: PCI interrupt for device 0000:01:00.0 disabled
After echoing all functions (ports) of the card to "permissive" as well: modprobe pciback for UNB in \ 0000:01:00.0 \ 0000:01:00.1 do for F in \ bnx2/unbind \ pciback/new_slot \ pciback/bind \ pciback/permissive do echo $UNB > /sys/bus/pci/drivers/$F done done xm pci-list-assignable-devices dom0 logged as follows: pciback 0000:01:00.0: enabling permissive mode configuration space accesses! pciback 0000:01:00.0: permissive mode is potentially unsafe! pciback 0000:01:00.1: enabling permissive mode configuration space accesses! pciback 0000:01:00.1: permissive mode is potentially unsafe! and when the domU was started, the "Driver tried to write to a read-only configuration space" dom0 messages disappeared. However, the domU and the HV still logged the same errors as in comment 11.
I tried to explore the call chain with gdbsx, continuing from comment 6, but the binary is so heavily optimized that I can't really check anything -- not even breakpoints seem to work. And unfortunately there's no "kernel-xen-debug" RPM. I'm adding tracing code.
Created attachment 521772 [details] add some tracing Prints: remaptrace = A298 bnx2 0000:00:00.0: Cannot map register space, aborting 111111 5432109876543210 1010001010011000 binary = 0xA298 The last HYPERVISOR_mmu_update() call fails in __direct_remap_pfn_range(). More tomorrow.
I tried to browse the hv source and google a bit. Seems like the problem is: get_page_from_l1e() [arch/x86/mm.c] -> iomem_access_permitted() Then get_page_from_l1e() prints "Non-privileged (1) attempt to map I/O space 00000000". The iomem access can be set up by the XEN_DOMCTL_iomem_permission domctl, and xend even supports that. http://lists.xensource.com/archives/html/xen-users/2008-06/msg01275.html setupDevice() [tools/python/xen/xend/server/pciif.py] It seems that this method calls PCIQuirk(), then somewhat down the page there's a loop enabling dev.iomem ranges. dev.iomem is filled in by PciDevice::get_info_from_sysfs() [tools/python/xen/util/pci.py] which reads seven (7, PROC_PCI_NUM_RESOURCES) lines from "/sys/bus/pci/devices/0000:01:00.0/resource". Each line has three fields: start, end, and flags. If "flags" has PCI_BAR_IO set (0x01), then the range is added as an ioport range, otherwise it's added as an iomem range. (Except if the start field equals zero, then the line is skipped.) Chapter 12 of the Linux Driver Development 3 book states that there are six (6) PCI I/O regions. Now, on dell-pet110-01.lab.bos.redhat.com, this is what I can see after a clean reboot to kernel-xen-2.6.18-274: - bnx2 (one of the ports) Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 0x00000000da000000 0x00000000dbffffff 0x0000000000020204 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 A single iomem range. There's no ioport range. (No line has the least significant bit set in the flags field.) - igb (one of the ports) Intel Corporation 82580 Gigabit Network Connection (rev 01) 0x00000000df400000 0x00000000df47ffff 0x0000000000020200 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x000000000000ec80 0x000000000000ec9f 0x0000000000020101 0x00000000df3f0000 0x00000000df3f3fff 0x0000000000020200 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 There's an ioport range from ec80 to ec9f, and 2 iomem ranges. - tg3 (single port) Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express 0x00000000df6f0000 0x00000000df6fffff 0x0000000000020204 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 Same situation as with bnx2. The bnx2 card lists the same resources under bare-metal. Looking at bnx2_init_board(), it grabs the "first line" via "pci_resource_start(pdev, 0)", ignores/recomputes the length field (it does not call "pci_resource_end(pdev, 0)", and tries to remap the resultant region. There must be a misunderstanding between xend setting up the iomem ranges in dom0 for the guest, based on what the dom0 bnx2 driver exports under sysfs, and between the domU driver trying to map those ranges.
(In reply to comment #15) > PciDevice::get_info_from_sysfs() [tools/python/xen/util/pci.py] > > which reads seven (7, PROC_PCI_NUM_RESOURCES) lines from > "/sys/bus/pci/devices/0000:01:00.0/resource". Each line has three fields: > start, end, and flags. If "flags" has PCI_BAR_IO set (0x01), then the range > is added as an ioport range, otherwise it's added as an iomem range. (Except > if the start field equals zero, then the line is skipped.) I failed to document that whenever an ioport range or an iomem range is added, any overlapping MSI-X iomem ranges are removed. (The dev.msix_iomem list has a "negative", "to-be-removed" meaning.) This can be tracked in xend.log (two bnx2 ports): NO quirks found for PCI device [14e4:1639:14e4:1917] Permissive mode NOT enabled for PCI device [14e4:1639:14e4:1917] pci: enabling iomem 0xda000000/0x2000000 pfn 0xda000/0x2000 pci-msix: remove permission for 0xda00c000/0x9000 0xda00c/0x9 pci-msix: remove permission for 0xda00e000/0x1000 0xda00e/0x1 pci: enabling irq 16 NO quirks found for PCI device [14e4:1639:14e4:1917] Permissive mode NOT enabled for PCI device [14e4:1639:14e4:1917] pci: enabling iomem 0xdc000000/0x2000000 pfn 0xdc000/0x2000 pci-msix: remove permission for 0xdc00c000/0x9000 0xdc00c/0x9 pci-msix: remove permission for 0xdc00e000/0x1000 0xdc00e/0x1 pci: enabling irq 19 The removed msi-x ranges seem to overlap with those that the hypervisor denies (see comment 11): (XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 000da00c (XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 000dc00c effectively punching holes in the iomem ranges that the bnx2 driver tries to remap. This functionality was added to RHEL-5 xend in commit f39cc73, which seems to be a big union of patches for many BZs. It's not easy to see which BZ required the backporting of remove_msix_iomem() from upstream c/s 17536:a0ebceaf41ff, but at any rate, this c/s was later reverted in 20171:55ef198e63c7: http://xenbits.xensource.com/hg/xen-unstable.hg/rev/55ef198e63c7 "xend: Revert c/s 17536 which breaks PV passthru of MSI-X devices." This patch reverts populating the "dev.msix_iomem" negative-meaning array during sysfs parsing, and also reverts the permission revocation for iomem ranges stored in that array. Here's a thread touching on these changesets. I was unable to find anything more relevant. http://lists.xensource.com/archives/html/xen-devel/2010-07/msg00281.html Changing component to xen. Options: - WONTFIX. The problem is a can of worms, and PV passthrough is vulnerable anyway. - Try to backport the patch and see if it works. If it does, based on the linked xen-devel message above, it would make PV passthrough even more vulnerable: the guest would get access to the MSI-X tables via the iomem ranges incorporating them, and then Bad Things (TM) could happen. I think it's not worth it.
Nice 2am debug work Laszlo! I agree with your options and your "not worth it" statement. IMHO, if we have any customers using insecure PV passthrough, then we should be pushing them towards a secure solution. Maintaining this bad idea (indeed making it even more vulnerable - a worse idea) isn't the right path for software at large. I'm just going to squash this bug now. Thanks again for the careful analysis and documentation.
*** Bug 796677 has been marked as a duplicate of this bug. ***