+++ This bug was initially created as a clone of Bug #1344299 +++ Even if the firmware skips assigning IO ranges to PCIe ports (root ports/downstream ports), Linux guests will still try to assign them IO. We can to add a parameter "disable-io" to PCIe ports to disable IO support. It will work by making IO base/limit registers read-only so both firmware and guest OSes will comply. --- Additional comment from RHEL Product and Program Management on 2016-06-09 08:04:38 EDT --- Since this bug report was entered in bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release. --- Additional comment from Marcel Apfelbaum on 2016-06-23 06:07:07 EDT --- We intend to provide a 'thin' version of Q35 for 7.3 to be used mainly with virtio devices which are PCIe, the IO limitation will not be an issue. Add libvirt support for the device command line parameters.
*** Bug 1504111 has been marked as a duplicate of this bug. ***
According to the understanding for this bug, I reproduced the following scenario: SC1: Boot a guest with 15 e1000 devices which plugged into 15 pcie-to-pci-bridge controllers, the pcie topology: 15 e1000 device --> 15 pcie-to-pci-bridge --> 15 pcie-root-port SC2: Using 10 e1000 devices to test ---The guest start successfully in this scenario. Reproduced version: qemu-kvm-rhev-2.12.0-19.el7_6.2.x86_64 libvirt-4.5.0-10.el7_6.3.x86_64 kernel-3.10.0-986.el7.x86_64 Reproduced steps: 1. Prepare a guest xml with 15 e1000 devices which plugged into 15 pcie-to-pci-bridge controllers. (can refer to the lmn.xml in attachment) 2. Define and start the guest. # virsh define lmn.xml Domain lmn defined from lmn.xml # virsh start lmn Domain lmn started # virsh list --all Id Name State ---------------------------------------------------- 16 lmn running 3. Check the boot process by virt-manager. The screen doesn't display properly, the guest may panic and halt.
(In reply to Meina Li from comment #7) > According to the understanding for this bug, I reproduced the following > scenario: > > SC1: Boot a guest with 15 e1000 devices which plugged into 15 > pcie-to-pci-bridge controllers, the pcie topology: 15 e1000 device --> 15 > pcie-to-pci-bridge --> 15 pcie-root-port > SC2: Using 10 e1000 devices to test ---The guest start successfully > in this scenario. Correct SC2: Using 9 e1000 devices to test > Reproduced version: > qemu-kvm-rhev-2.12.0-19.el7_6.2.x86_64 > libvirt-4.5.0-10.el7_6.3.x86_64 > kernel-3.10.0-986.el7.x86_64 > > Reproduced steps: > > 1. Prepare a guest xml with 15 e1000 devices which plugged into 15 > pcie-to-pci-bridge controllers. > (can refer to the lmn.xml in attachment) > > 2. Define and start the guest. > # virsh define lmn.xml > Domain lmn defined from lmn.xml > # virsh start lmn > Domain lmn started > # virsh list --all > Id Name State > ---------------------------------------------------- > 16 lmn running > > 3. Check the boot process by virt-manager. > The screen doesn't display properly, the guest may panic and halt.
Created attachment 1519673 [details] lmn.xml for testing 15 e1000 devices
Some relevant discussion happening on qemu-devel. https://lists.nongnu.org/archive/html/qemu-devel/2019-06/msg01093.html
I spent some time trying to figure out whether this is something that makes sense to expose in libvirt after all. As I understand it, the situation that lead to the QEMU option being implemented is as such: * PCI I/O port space is a fairly limited resource on q35 (in the order of 64 KiB if I'm not mistaken); * some devices, often those that are assigned from the host, require I/O space to work; * other devices, notably virtio 1.0 ones, don't; * since QEMU / the firmware / the kernel can't know in advance what kind of device will end up being hotplugged into a PCI slot, the only reasonable course of action is to set aside some I/O space "just in case"; * however, with 4 KiB reserved for each pcie-root-port and only a single PCI function available on it, I/O space runs out relatively quickly, leading to various issues. The plan was apparently to make *not* reserving I/O space the default, but somehow that hasn't happened so far and each pcie-root-port still reserves 4 KiB of I/O space unless told otherwise. To see this in action, create a q35 VM with 15 of these <interface type='network'> <source network='default'/> <model type='virtio'/> </interface> in addition to the usual virtio devices: libvirt will automatically allocate a pcie-root-port for each of them, and the result will look like $ lspci -vt -[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller +-01.0-[01]----00.0 Red Hat, Inc. Virtio network device +-01.1-[02]----00.0 Red Hat, Inc. Virtio console +-01.2-[03]----00.0 Red Hat, Inc. Virtio block device +-01.3-[04]----00.0 Red Hat, Inc. Virtio memory balloon +-01.4-[05]----00.0 Red Hat, Inc. Virtio RNG +-01.5-[06]----00.0 Red Hat, Inc. Virtio network device +-01.6-[07]----00.0 Red Hat, Inc. Virtio network device +-01.7-[08]----00.0 Red Hat, Inc. Virtio network device +-02.0-[09]----00.0 Red Hat, Inc. Virtio network device +-02.1-[0a]----00.0 Red Hat, Inc. Virtio network device +-02.2-[0b]----00.0 Red Hat, Inc. Virtio network device +-02.3-[0c]----00.0 Red Hat, Inc. Virtio network device +-02.4-[0d]----00.0 Red Hat, Inc. Virtio network device +-02.5-[0e]----00.0 Red Hat, Inc. Virtio network device +-02.6-[0f]----00.0 Red Hat, Inc. Virtio network device +-02.7-[10]----00.0 Red Hat, Inc. Virtio network device +-03.0-[11]----00.0 Red Hat, Inc. Virtio network device +-03.1-[12]----00.0 Red Hat, Inc. Virtio network device +-03.2-[13]----00.0 Red Hat, Inc. Virtio network device +-1f.0 Intel Corporation 82801IB (ICH9) LPC Interface Controller +-1f.2 Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] \-1f.3 Intel Corporation 82801I (ICH9 Family) SMBus Controller If we look at the kernel messages, we can see that I/O space allocation wasn't successful for all devices: $ sudo dmesg | grep BAR [ 0.681478] pci 0000:00:01.0: BAR 13: assigned [io 0x1000-0x1fff] [ 0.681479] pci 0000:00:01.1: BAR 13: assigned [io 0x2000-0x2fff] [ 0.681480] pci 0000:00:01.2: BAR 13: assigned [io 0x3000-0x3fff] [ 0.681481] pci 0000:00:01.3: BAR 13: assigned [io 0x4000-0x4fff] [ 0.681482] pci 0000:00:01.4: BAR 13: assigned [io 0x5000-0x5fff] [ 0.681482] pci 0000:00:01.5: BAR 13: assigned [io 0x6000-0x6fff] [ 0.681483] pci 0000:00:01.6: BAR 13: assigned [io 0x7000-0x7fff] [ 0.681484] pci 0000:00:01.7: BAR 13: assigned [io 0x8000-0x8fff] [ 0.681485] pci 0000:00:02.0: BAR 13: assigned [io 0x9000-0x9fff] [ 0.681486] pci 0000:00:02.1: BAR 13: assigned [io 0xa000-0xafff] [ 0.681486] pci 0000:00:02.2: BAR 13: assigned [io 0xb000-0xbfff] [ 0.681487] pci 0000:00:02.3: BAR 13: assigned [io 0xd000-0xdfff] [ 0.681488] pci 0000:00:02.4: BAR 13: assigned [io 0xe000-0xefff] [ 0.681489] pci 0000:00:02.5: BAR 13: assigned [io 0xf000-0xffff] [ 0.681490] pci 0000:00:02.6: BAR 13: no space for [io size 0x1000] [ 0.681491] pci 0000:00:02.6: BAR 13: failed to assign [io size 0x1000] [ 0.681492] pci 0000:00:02.7: BAR 13: no space for [io size 0x1000] [ 0.681492] pci 0000:00:02.7: BAR 13: failed to assign [io size 0x1000] [ 0.681493] pci 0000:00:03.0: BAR 13: no space for [io size 0x1000] [ 0.681494] pci 0000:00:03.0: BAR 13: failed to assign [io size 0x1000] [ 0.681495] pci 0000:00:03.1: BAR 13: no space for [io size 0x1000] [ 0.681495] pci 0000:00:03.1: BAR 13: failed to assign [io size 0x1000] [ 0.681496] pci 0000:00:03.2: BAR 13: no space for [io size 0x1000] [ 0.681497] pci 0000:00:03.2: BAR 13: failed to assign [io size 0x1000] [ 0.681498] pci 0000:00:03.2: BAR 13: assigned [io 0x1000-0x1fff] [ 0.681499] pci 0000:00:03.1: BAR 13: assigned [io 0x2000-0x2fff] [ 0.681500] pci 0000:00:03.0: BAR 13: assigned [io 0x3000-0x3fff] [ 0.681501] pci 0000:00:02.7: BAR 13: assigned [io 0x4000-0x4fff] [ 0.681502] pci 0000:00:02.6: BAR 13: assigned [io 0x5000-0x5fff] [ 0.681502] pci 0000:00:02.5: BAR 13: assigned [io 0x6000-0x6fff] [ 0.681503] pci 0000:00:02.4: BAR 13: assigned [io 0x7000-0x7fff] [ 0.681504] pci 0000:00:02.3: BAR 13: assigned [io 0x8000-0x8fff] [ 0.681505] pci 0000:00:02.2: BAR 13: assigned [io 0x9000-0x9fff] [ 0.681506] pci 0000:00:02.1: BAR 13: assigned [io 0xa000-0xafff] [ 0.681507] pci 0000:00:02.0: BAR 13: assigned [io 0xb000-0xbfff] [ 0.681507] pci 0000:00:01.7: BAR 13: assigned [io 0xd000-0xdfff] [ 0.681508] pci 0000:00:01.6: BAR 13: assigned [io 0xe000-0xefff] [ 0.681509] pci 0000:00:01.5: BAR 13: assigned [io 0xf000-0xffff] [ 0.681510] pci 0000:00:01.4: BAR 13: no space for [io size 0x1000] [ 0.681511] pci 0000:00:01.4: BAR 13: failed to assign [io size 0x1000] [ 0.681511] pci 0000:00:01.3: BAR 13: no space for [io size 0x1000] [ 0.681512] pci 0000:00:01.3: BAR 13: failed to assign [io size 0x1000] [ 0.681513] pci 0000:00:01.2: BAR 13: no space for [io size 0x1000] [ 0.681513] pci 0000:00:01.2: BAR 13: failed to assign [io size 0x1000] [ 0.681514] pci 0000:00:01.1: BAR 13: no space for [io size 0x1000] [ 0.681515] pci 0000:00:01.1: BAR 13: failed to assign [io size 0x1000] [ 0.681516] pci 0000:00:01.0: BAR 13: no space for [io size 0x1000] [ 0.681516] pci 0000:00:01.0: BAR 13: failed to assign [io size 0x1000] If we check the amount of I/O space reserved by each controller and used by each device, we can see that a few of the pcie-root-ports don't have any: $ sudo lspci -vv | grep -E '^[0-9]|I/O ' 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller 00:01.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 00:01.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 00:01.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 00:01.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 00:01.4 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 00:01.5 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 0000f000-0000ffff [size=4K] 00:01.6 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 0000e000-0000efff [size=4K] 00:01.7 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 0000d000-0000dfff [size=4K] 00:02.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 0000b000-0000bfff [size=4K] 00:02.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 0000a000-0000afff [size=4K] 00:02.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00009000-00009fff [size=4K] 00:02.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00008000-00008fff [size=4K] 00:02.4 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00007000-00007fff [size=4K] 00:02.5 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00006000-00006fff [size=4K] 00:02.6 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00005000-00005fff [size=4K] 00:02.7 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00004000-00004fff [size=4K] 00:03.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00003000-00003fff [size=4K] 00:03.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00002000-00002fff [size=4K] 00:03.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: 00001000-00001fff [size=4K] 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02) 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02) (prog-if 01 [AHCI 1.0]) Region 4: I/O ports at c040 [size=32] 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02) Region 4: I/O ports at 0700 [size=64] 01:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 02:00.0 Communication controller: Red Hat, Inc. Virtio console (rev 01) 03:00.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01) 04:00.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon (rev 01) 05:00.0 Unclassified device [00ff]: Red Hat, Inc. Virtio RNG (rev 01) 06:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 07:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 08:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 09:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 0a:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 0b:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 0c:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 0d:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 0e:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 0f:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 10:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 11:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 12:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) 13:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) We can also see that none of the virtio-net devices are using any: this explains why, despite the kernel messages, the system is okay and the network interfaces all work fine: in particular $ ip addr show dev enp1s0 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:7a:15:9b brd ff:ff:ff:ff:ff:ff inet 192.168.122.65/24 brd 192.168.122.255 scope global dynamic enp1s0 valid_lft 3442sec preferred_lft 3442sec inet6 fe80::5054:ff:fe7a:159b/64 scope link valid_lft forever preferred_lft forever $ ls -l /sys/class/net/enp1s0 lrwxrwxrwx 1 root root 0 Jan 14 19:43 /sys/class/net/enp1s0 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/virtio0/net/enp1s0 So enp1s0, aka 0000:01:00.0, is plugged into 0000:00:01.0, which according to the output above has no I/O space reserved to it, and yet that's the very interface that was used when I ssh'd into the VM. So far so good! If you try, however, to replace all virtio-net devices with e1000e devices, which apparently need a tiny amount of I/O space to operate, then the guest will no longer boot even to the point where the kernel is loaded. I guess SeaBIOS can't figure out how to assign I/O space to all devices that want some, and so it simply gives up? Note that adding io-reserve=0 to all pcie-root-port does exactly *nothing* to address this situation: whether or not the flag is present, boot will always get stuck before the kernel is even loaded. Another interesting / confusing thing that I noticed: if I disable I/O space for all pcie-root-ports and then attach an e1000e network device to one, I end up with this situation: -[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller +-01.0-[01]----00.0 Intel Corporation 82574L Gigabit Network Connection 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller 00:01.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Region 2: I/O ports at 1000 [size=32] However, I also have 16: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:7a:15:9b brd ff:ff:ff:ff:ff:ff inet 192.168.122.65/24 brd 192.168.122.255 scope global dynamic enp1s0 valid_lft 3474sec preferred_lft 3474sec inet6 fe80::5054:ff:fe7a:159b/64 scope link valid_lft forever preferred_lft forever and guest <-> host communication works perfectly. I remember learning from someone, at some point, that PCI Express devices are required to work without I/O space, so maybe the e1000e would use it if available, but can cope with it being absent? That would explain it... However, I also tried using e1000, which is a conventional PCI device, instead of e1000e, and it still worked.
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.
(In reply to Andrea Bolognani from comment #14) > So far so good! If you try, however, to replace all virtio-net > devices with e1000e devices, which apparently need a tiny amount of > I/O space to operate, then the guest will no longer boot even to the > point where the kernel is loaded. I guess SeaBIOS can't figure out > how to assign I/O space to all devices that want some, and so it > simply gives up? Basically: yes. If you capture the seabios log, you'll see more details, I believe. > Note that adding io-reserve=0 to all pcie-root-port does exactly > *nothing* to address this situation: whether or not the flag is > present, boot will always get stuck before the kernel is even loaded. It seems plausible that SeaBIOS's resource allocation for a PCIe root port is primarily driven by the needs of the device behind that port. After all, the property "io-reserve" says "reservation" in the name, and if there's a cold-plugged device in the port, it makes sense that you can't "un-reserve" the IO space that the device is actively asking for. IOW, I think io-reserve=0 might only make a difference if the port is empty, at boot. The property decides about reserving vs. not reserving for hotplug purposes; it likely cannot override actual resource needs coming from the downstream side of the port. (I'm saying "likely" because I'm not familiar with the SeaBIOS internals.) I think it does make sense to expose this property in the domain XML. You could want to have 20 *empty* PCIe root ports at boot, with a plan to hot-plug up to 20 PCI Express devices, with none of those devices needing any IO space.
(In reply to Laszlo Ersek from comment #16) > (In reply to Andrea Bolognani from comment #14) > > So far so good! If you try, however, to replace all virtio-net > > devices with e1000e devices, which apparently need a tiny amount of > > I/O space to operate, then the guest will no longer boot even to the > > point where the kernel is loaded. I guess SeaBIOS can't figure out > > how to assign I/O space to all devices that want some, and so it > > simply gives up? > > Basically: yes. If you capture the seabios log, you'll see more details, I > believe. You were right: when SeaBIOS gets stuck, it prints PCI: out of I/O address space to the log. > > Note that adding io-reserve=0 to all pcie-root-port does exactly > > *nothing* to address this situation: whether or not the flag is > > present, boot will always get stuck before the kernel is even loaded. > > It seems plausible that SeaBIOS's resource allocation for a PCIe root port > is primarily driven by the needs of the device behind that port. After all, > the property "io-reserve" says "reservation" in the name, and if there's a > cold-plugged device in the port, it makes sense that you can't "un-reserve" > the IO space that the device is actively asking for. IOW, I think > io-reserve=0 might only make a difference if the port is empty, at boot. The > property decides about reserving vs. not reserving for hotplug purposes; it > likely cannot override actual resource needs coming from the downstream side > of the port. > > (I'm saying "likely" because I'm not familiar with the SeaBIOS internals.) This makes sense, and testing at least partially confirms this: if I boot with an e1000e plugged into a pcie-root-port,io-reserve=0, I get 00:01.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Region 2: I/O ports at 1000 [size=32] but if I leave the PCI port empty and hotplug the network adapter, the result is 00:01.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Region 2: I/O ports at 1000 [disabled] [size=32] The network connection works in both cases. I still can't understand how the e1000 can work in this scenario: regardless of whether coldplug or hotplug are used, it still shows up as 00:01.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode]) I/O behind bridge: [disabled] 01:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03) Region 1: I/O ports at 1000 [size=64] and works fine... > I think it does make sense to expose this property in the domain XML. You > could want to have 20 *empty* PCIe root ports at boot, with a plan to > hot-plug up to 20 PCI Express devices, with none of those devices needing > any IO space. Yeah, I can see that being a reasonable use case, at least in theory. In practice, it looks like whether or not I/O space is reserved, required or used is not such a clear-cut question... But it probably still makes sense to expose this know at the libvirt level. Reopening the bug. Thank you for all your useful input, Laszlo! :)