Bug 735890 - xend revokes iomem permissions inside PCI iomem ranges to protect against MSI-X table access
Summary: xend revokes iomem permissions inside PCI iomem ranges to protect against MSI...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: rc
: ---
Assignee: Xen Maintainance List
QA Contact: Virtualization Bugs
URL:
Whiteboard:
: 796677 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-09-06 03:28 UTC by Yufang Zhang
Modified: 2018-11-26 19:16 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-09-07 07:10:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
xm dmesg output (8.77 KB, text/plain)
2011-09-06 03:37 UTC, Yufang Zhang
no flags Details
add some tracing (5.37 KB, patch)
2011-09-06 22:05 UTC, Laszlo Ersek
no flags Details | Diff

Description Yufang Zhang 2011-09-06 03:28:05 UTC
Description of problem:
When assign broadcom network card to rhel5 PV guest via pci pass-through, bnx2 driver cannot be loaded and bond to the pci device. So the network card cannot be used within the guest.

Version-Release number of selected component (if applicable):
guest: 2.6.18-274.el5xen
host: 2.6.18-274.el5xen

How reproducible:
Always

Steps to Reproduce:
1. In the host, detach the network card from domain0 and bind it to pciback driver.
2. Start the PV guest with the network card assigned. 
3. After guest boots, we could see the pass-throughed pci device(network card), but the bnx2 driver cannot be loaded. In the guest:

# lspci
00:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
00:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

# dmesg | grep -i bnx
bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.0.21 (Dec 23, 2010)
bnx2 0000:00:00.0: Cannot map register space, aborting
bnx2: probe of 0000:00:00.0 failed with error -12
bnx2 0000:00:00.1: Cannot map register space, aborting
bnx2: probe of 0000:00:00.1 failed with error -12
bnx2 0000:00:00.0: Cannot map register space, aborting
bnx2: probe of 0000:00:00.0 failed with error -12
  
Actual results:
bnx2 driver cannot be bond to pci device(Broadcom network card) assigned to the PV guest. Thus the network card cannot be used in the guest. 

Expected results:
network card can be used within the guest.

Additional info:
1. igb driver works well with the same scenario.
2. config file of the guest:
   name = "test-6"
   memory = "1024"
   vcpus = 1
   disk = [ 'phy:/dev/xenvg1/test-6,sda1,w']
   bootloader="/usr/bin/pygrub"
   on_reboot = 'restart'
   on_crash = 'restart'
   pci = ['0000:01:00.1','0000:01:00.0']
3. lspci -v output in the guest:

# lspci -v
00:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
	Subsystem: Dell PowerEdge R710 BCM5709 Gigabit Ethernet
	Flags: fast devsel, IRQ 23
	Memory at d4000000 (64-bit, non-prefetchable) [disabled] [size=32M]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable-
	Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9
	Capabilities: [ac] Express Endpoint IRQ 0

00:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
	Subsystem: Dell PowerEdge R710 BCM5709 Gigabit Ethernet
	Flags: fast devsel, IRQ 24
	Memory at d6000000 (64-bit, non-prefetchable) [disabled] [size=32M]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable-
	Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9
	Capabilities: [ac] Express Endpoint IRQ 0

Comment 1 Yufang Zhang 2011-09-06 03:37:07 UTC
Created attachment 521564 [details]
xm dmesg output

Comment 2 Yufang Zhang 2011-09-06 03:47:52 UTC
dmesg output of domain0 after binding pci device to pciback driver:

pciback: vpci: 0000:01:00.1: assign to virtual slot 0
pciback: vpci: 0000:01:00.0: assign to virtual slot 0 func 0
blkback: ring-ref 9, event-channel 7, protocol 1 (x86_64-abi)
pciback 0000:01:00.0: Driver tried to write to a read-only configuration space field at offset 0x4c, size 2. This may be harmless, but if you have problems with your device:
1) see permissive attribute in sysfs
2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
PCI: Enabling device 0000:01:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 36 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:01:00.0 to 64
ACPI: PCI interrupt for device 0000:01:00.0 disabled
pciback 0000:01:00.1: Driver tried to write to a read-only configuration space field at offset 0x4c, size 2. This may be harmless, but if you have problems with your device:
1) see permissive attribute in sysfs
2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
PCI: Enabling device 0000:01:00.1 (0000 -> 0002)
ACPI: PCI Interrupt 0000:01:00.1[B] -> GSI 48 (level, low) -> IRQ 24
PCI: Setting latency timer of device 0000:01:00.1 to 64
ACPI: PCI interrupt for device 0000:01:00.1 disabled
ACPI: PCI interrupt for device 0000:01:00.1 disabled
ACPI: PCI interrupt for device 0000:01:00.0 disabled
pciback: vpci: 0000:01:00.1: assign to virtual slot 0
pciback: vpci: 0000:01:00.0: assign to virtual slot 0 func 0
blkback: ring-ref 9, event-channel 7, protocol 1 (x86_64-abi)
PCI: Enabling device 0000:01:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 36 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:01:00.0 to 64
ACPI: PCI interrupt for device 0000:01:00.0 disabled
PCI: Enabling device 0000:01:00.1 (0000 -> 0002)
ACPI: PCI Interrupt 0000:01:00.1[B] -> GSI 48 (level, low) -> IRQ 24
PCI: Setting latency timer of device 0000:01:00.1 to 64
ACPI: PCI interrupt for device 0000:01:00.1 disabled
PCI: Enabling device 0000:01:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 36 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:01:00.0 to 64
ACPI: PCI interrupt for device 0000:01:00.0 disabled

Comment 3 Qixiang Wan 2011-09-06 06:59:32 UTC
QE can reproduce this issue with RHEL5.7 PV x86_64 guest, the same nic can be pass-through to RHEL5.7 x86_64 HVM guest and works within the guest.

Comment 4 Laszlo Ersek 2011-09-06 09:01:31 UTC
Hi Yufang,

is this a regression?

Comment 5 Qixiang Wan 2011-09-06 09:26:04 UTC
(In reply to comment #4)
> is this a regression?

Not a regression, can be reproduced with RHEL5.5 and 5.6 pv guests.

Comment 6 Laszlo Ersek 2011-09-06 09:29:13 UTC
bnx2_init_board() [drivers/net/bnx2.c]
-> ioremap_nocache() [arch/i386/mm/ioremap-xen.c]
[...]

Found a perhaps relevant report on xen-devel:

http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00244.html

Can you try that?

Also, could you try writing the full PCI identifier to "/sys/bus/pci/drivers/pciback/permissive" before starting the guest?

permissive_add() [drivers/xen/pciback/pci_stub.c]
-> str_to_slot()
  -> sscanf(buf, " %x:%x:%x.%x", domain, bus, slot, func)

I'll try to get a Beaker machine with two NICs, one of them being bnx2 and test.

Comment 11 Laszlo Ersek 2011-09-06 20:01:48 UTC
Thanks Qixiang, this worked; I managed to reproduce the problem. Guest said:

bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.0.21 (Dec 23, 2010)
PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.0 to 64
bnx2 0000:00:00.0: Cannot map register space, aborting
bnx2: probe of 0000:00:00.0 failed with error -12
PCI: Enabling device 0000:00:00.1 (0000 -> 0002)
PCI: Setting latency timer of device 0000:00:00.1 to 64
bnx2 0000:00:00.1: Cannot map register space, aborting
bnx2: probe of 0000:00:00.1 failed with error -12

HV said (log possibly truncated):

(XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 00000000
(XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 00000000
(XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 000da00c
(XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 000dc00c

dom0 said (twice):

pciback 0000:01:00.0: Driver tried to write to a read-only configuration space field at offset 0x4c, size 2. This may be harmless, but if you have problems with your device:
1) see permissive attribute in sysfs
2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
PCI: Enabling device 0000:01:00.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:01:00.0 to 64
ACPI: PCI interrupt for device 0000:01:00.0 disabled

Comment 12 Laszlo Ersek 2011-09-06 20:19:55 UTC
After echoing all functions (ports) of the card to "permissive" as well:

  modprobe pciback
  for UNB in \
      0000:01:00.0 \
      0000:01:00.1
  do
    for F in \
        bnx2/unbind \
        pciback/new_slot \
        pciback/bind \
        pciback/permissive
    do
      echo $UNB > /sys/bus/pci/drivers/$F
    done
  done
  xm pci-list-assignable-devices

dom0 logged as follows:

  pciback 0000:01:00.0: enabling permissive mode configuration space accesses!
  pciback 0000:01:00.0: permissive mode is potentially unsafe!
  pciback 0000:01:00.1: enabling permissive mode configuration space accesses!
  pciback 0000:01:00.1: permissive mode is potentially unsafe!

and when the domU was started, the "Driver tried to write to a read-only configuration space" dom0 messages disappeared. However, the domU and the HV still logged the same errors as in comment 11.

Comment 13 Laszlo Ersek 2011-09-06 21:24:26 UTC
I tried to explore the call chain with gdbsx, continuing from comment 6, but the binary is so heavily optimized that I can't really check anything -- not even breakpoints seem to work. And unfortunately there's no "kernel-xen-debug" RPM. I'm adding tracing code.

Comment 14 Laszlo Ersek 2011-09-06 22:05:29 UTC
Created attachment 521772 [details]
add some tracing

Prints:

    remaptrace = A298
    bnx2 0000:00:00.0: Cannot map register space, aborting

111111
5432109876543210
1010001010011000 binary = 0xA298

The last HYPERVISOR_mmu_update() call fails in __direct_remap_pfn_range(). More tomorrow.

Comment 15 Laszlo Ersek 2011-09-06 23:15:47 UTC
I tried to browse the hv source and google a bit. Seems like the problem is:

get_page_from_l1e() [arch/x86/mm.c]
-> iomem_access_permitted()

Then get_page_from_l1e() prints "Non-privileged (1) attempt to map I/O space 00000000". The iomem access can be set up by the XEN_DOMCTL_iomem_permission domctl, and xend even supports that.

http://lists.xensource.com/archives/html/xen-users/2008-06/msg01275.html

setupDevice() [tools/python/xen/xend/server/pciif.py]

It seems that this method calls PCIQuirk(), then somewhat down the page there's a loop enabling dev.iomem ranges. dev.iomem is filled in by

PciDevice::get_info_from_sysfs() [tools/python/xen/util/pci.py]

which reads seven (7, PROC_PCI_NUM_RESOURCES) lines from "/sys/bus/pci/devices/0000:01:00.0/resource". Each line has three fields: start, end, and flags. If "flags" has PCI_BAR_IO set (0x01), then the range is added as an ioport range, otherwise it's added as an iomem range. (Except if the start field equals zero, then the line is skipped.) Chapter 12 of the Linux Driver Development 3 book states that there are six (6) PCI I/O regions.

Now, on dell-pet110-01.lab.bos.redhat.com, this is what I can see after a clean reboot to kernel-xen-2.6.18-274:

- bnx2 (one of the ports)
  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

  0x00000000da000000 0x00000000dbffffff 0x0000000000020204
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000

A single iomem range. There's no ioport range. (No line has the least significant bit set in the flags field.)

- igb (one of the ports)
  Intel Corporation 82580 Gigabit Network Connection (rev 01)

  0x00000000df400000 0x00000000df47ffff 0x0000000000020200
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x000000000000ec80 0x000000000000ec9f 0x0000000000020101
  0x00000000df3f0000 0x00000000df3f3fff 0x0000000000020200
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000

There's an ioport range from ec80 to ec9f, and 2 iomem ranges.

- tg3 (single port)
  Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express
  0x00000000df6f0000 0x00000000df6fffff 0x0000000000020204
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000
  0x0000000000000000 0x0000000000000000 0x0000000000000000

Same situation as with bnx2.

The bnx2 card lists the same resources under bare-metal. Looking at bnx2_init_board(), it grabs the "first line" via "pci_resource_start(pdev, 0)", ignores/recomputes the length field (it does not call "pci_resource_end(pdev, 0)", and tries to remap the resultant region.

There must be a misunderstanding between xend setting up the iomem ranges in dom0 for the guest, based on what the dom0 bnx2 driver exports under sysfs, and between the domU driver trying to map those ranges.

Comment 16 Laszlo Ersek 2011-09-07 00:03:04 UTC
(In reply to comment #15)

> PciDevice::get_info_from_sysfs() [tools/python/xen/util/pci.py]
> 
> which reads seven (7, PROC_PCI_NUM_RESOURCES) lines from
> "/sys/bus/pci/devices/0000:01:00.0/resource". Each line has three fields:
> start, end, and flags. If "flags" has PCI_BAR_IO set (0x01), then the range
> is added as an ioport range, otherwise it's added as an iomem range. (Except
> if the start field equals zero, then the line is skipped.)

I failed to document that whenever an ioport range or an iomem range is added, any overlapping MSI-X iomem ranges are removed. (The dev.msix_iomem list has a "negative", "to-be-removed" meaning.) This can be tracked in xend.log (two bnx2 ports):

    NO quirks found for PCI device [14e4:1639:14e4:1917]
    Permissive mode NOT enabled for PCI device [14e4:1639:14e4:1917]
    pci: enabling iomem 0xda000000/0x2000000 pfn 0xda000/0x2000
    pci-msix: remove permission for 0xda00c000/0x9000 0xda00c/0x9
    pci-msix: remove permission for 0xda00e000/0x1000 0xda00e/0x1
    pci: enabling irq 16

    NO quirks found for PCI device [14e4:1639:14e4:1917]
    Permissive mode NOT enabled for PCI device [14e4:1639:14e4:1917]
    pci: enabling iomem 0xdc000000/0x2000000 pfn 0xdc000/0x2000
    pci-msix: remove permission for 0xdc00c000/0x9000 0xdc00c/0x9
    pci-msix: remove permission for 0xdc00e000/0x1000 0xdc00e/0x1
    pci: enabling irq 19

The removed msi-x ranges seem to overlap with those that the hypervisor denies (see comment 11):

    (XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 000da00c

    (XEN) mm.c:630:d1 Non-privileged (1) attempt to map I/O space 000dc00c

effectively punching holes in the iomem ranges that the bnx2 driver tries to remap.

This functionality was added to RHEL-5 xend in commit f39cc73, which seems to be a big union of patches for many BZs. It's not easy to see which BZ required the backporting of remove_msix_iomem() from upstream c/s 17536:a0ebceaf41ff, but at any rate, this c/s was later reverted in 20171:55ef198e63c7:

http://xenbits.xensource.com/hg/xen-unstable.hg/rev/55ef198e63c7

"xend: Revert c/s 17536 which breaks PV passthru of MSI-X devices."

This patch reverts populating the "dev.msix_iomem" negative-meaning array during sysfs parsing, and also reverts the permission revocation for iomem ranges stored in that array.

Here's a thread touching on these changesets. I was unable to find anything more relevant.

http://lists.xensource.com/archives/html/xen-devel/2010-07/msg00281.html

Changing component to xen. Options:
- WONTFIX. The problem is a can of worms, and PV passthrough is vulnerable anyway.
- Try to backport the patch and see if it works. If it does, based on the linked xen-devel message above, it would make PV passthrough even more vulnerable: the guest would get access to the MSI-X tables via the iomem ranges incorporating them, and then Bad Things (TM) could happen.

I think it's not worth it.

Comment 17 Andrew Jones 2011-09-07 07:10:24 UTC
Nice 2am debug work Laszlo! I agree with your options and your "not worth it" statement. IMHO, if we have any customers using insecure PV passthrough, then we should be pushing them towards a secure solution. Maintaining this bad idea (indeed making it even more vulnerable - a worse idea) isn't the right path for software at large.

I'm just going to squash this bug now. Thanks again for the careful analysis and documentation.

Comment 18 Laszlo Ersek 2012-02-27 18:58:49 UTC
*** Bug 796677 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.