This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2024818 - [Windows_vm][Q35+ OVMF] Some hot-plugged PF/VF can not find enough free resources that it can use
Summary: [Windows_vm][Q35+ OVMF] Some hot-plugged PF/VF can not find enough free resou...
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: edk2
Version: 9.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Gerd Hoffmann
QA Contact: Yanghang Liu
URL:
Whiteboard:
Depends On: 2203094
Blocks: 2084533
TreeView+ depends on / blocked
 
Reported: 2021-11-19 07:04 UTC by Yanghang Liu
Modified: 2023-10-29 04:25 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2084533 (view as bug list)
Environment:
Last Closed: 2023-06-30 18:38:35 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fw log with pci.1 bridge no beining intialized properly (115.61 KB, text/plain)
2023-05-19 09:34 UTC, Igor Mammedov
no flags Details
OVMF log with vfio_hotplug_config (950.25 KB, text/plain)
2023-05-19 13:29 UTC, Igor Mammedov
no flags Details
domain config vfio_hotplug_config.xml (8.61 KB, text/plain)
2023-05-19 13:32 UTC, Igor Mammedov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1408810 0 low CLOSED PCIe: Add an option to PCIe ports to disable IO port space support 2023-05-19 10:43:07 UTC
Red Hat Issue Tracker   RHEL-698 0 None None None 2023-06-30 18:38:34 UTC
Red Hat Issue Tracker RHELPLAN-103219 0 None None None 2021-11-19 07:06:53 UTC

Description Yanghang Liu 2021-11-19 07:04:50 UTC
Description of problem:
The MT2892 PF(mlx5_core) can not find enough free resources that it can use after it is hot-plugged into a Q35 + OVMF windows vm

Version-Release number of selected component (if applicable):
qemu-kvm-6.1.0-6.el9.x86_64
edk2-ovmf-20210527gite1999b264f1f-6.el9.noarch
5.14.0-17.el9.x86_64
seabios-bin-1.14.0-7.el9.noarch


How reproducible:
100%

Steps to Reproduce:
1. import a Q35 + OVMF Windows domain
# virt-install --machine=q35 --noreboot --name=win2022  --boot=uefi  --network bridge=switch,model=virtio,mac=52:54:00:01:22:22  --memory=4096 --vcpus=4 --graphics type=vnc,port=5922,listen=0.0.0.0 --import --noautoconsole --disk path=/home/images/win2022-64-virtio.qcow2,bus=virtio,cache=none,format=qcow2,io=threads,size=20

2. hot-plug a MT2892(mlx5_core) into the Windows domain
# virsh attach-device $windows_vm 0000\:3b\:00.0.xml
# cat 0000\:3b\:00.0.xml
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x3b' slot='0x00' function='0x0'/>
      </source>
    </hostdev>

The related qmp:
> {"execute":"device_add","arguments":{"driver":"vfio-pci","host":"0000:3b:00.0","id":"hostdev0","bus":"pci.4","addr":"0x0"},"id":"libvirt-389"}
<  {"return": {}, "id": "libvirt-389"}


3. check the device info in Windows domain "Device Manager"
The "Device Manager" shows "The device cannot find enough free resources that it can use(code 12)"

Actual results:
The hot-plugged MT2892 PF can not find enough free resources that it can use

Expected results:
The hot-plugged MT2892 PF works well.

Additional info:
(1) This problem can be reproduced by:
Hotplug 6 MT2892 VFs(mlx5_core) into the vm 
Hotplug 6 QL41112 VFs(qede) into the vm 
Hotplug 2 QL41112 PFs(qede) into the vm

(2) The  driver:
MT2892 Driver : https://www.mellanox.com/products/adapter-software/ethernet/windows/winof-2
SFC9220 Driver: https://support-nic.xilinx.com/wp/drivers

(3) This problem *can still be* reproced when adding "-global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off" into the domain

Comment 1 Yanghang Liu 2021-11-19 07:49:49 UTC
> This problem *can still be* reproced when adding "-global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off" into the domain

I know there are several similar Q35 + hot-plug bugs that have not been fixed yet.
 
 Bug 2001732 - [virtual network][qemu-6.1.0-1] Fail to hotplug nic with rtl8139 driver
 Bug 2001719 - fail to hotplug NIC with edk2 firmware
 Bug 2004829 - [ovmf] The guest does not present hot-plugged disk
 Bug 2007129 - pcie hotplug emulation has various problems due to insufficient state tracking 
 ...

But it seems to me that this bug *may has a different root cause* with above bugs because *this bug can still be reproduced after adding '-global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off' into the domain*.


I will do related tests as well if the above bugs have been fixed first.

Comment 3 Yanghang Liu 2021-12-27 14:15:12 UTC
The bug can still be reproduced in the following test environment:
qemu-kvm-6.2.0-1.el9.x86_64
libvirt-7.10.0-1.el9.x86_64
edk2-ovmf-20210527gite1999b264f1f-7.el9.noarch
seabios-bin-1.14.0-7.el9.noarch
5.14.0-39.el9.x86_64

Comment 12 Yanghang Liu 2022-04-18 06:13:35 UTC
This bug can still be reproduced in the following test env:
5.14.0-78.el9.x86_64
qemu-kvm-7.0.0-0.rc3.el9.wrb220406.x86_64
edk2-ovmf-20220221gitb24306f15d-1.el9.noarch
seabios-bin-1.15.0-1.el9.noarch

Comment 13 Yanghang Liu 2022-04-24 09:11:19 UTC
A workaround:

This PF/VF can be hot-plug into the vm after adding the following setup into the vm cfg:

-global pcie-root-port.pref64-reserve=64M  
or
  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.pref64-reserve=64M'/>
  </qemu:commandline>



A similar bug which is tracking the RHEL VM:
Bug 2055123 - [Q35] Failed to hot-plug a device whose membar > 2M into the vm

Comment 17 Alex Williamson 2022-05-11 14:39:43 UTC
(In reply to Laine Stump from comment #15)
> I don't consider myself qualified to provide a "safe" answer to that
> question, but isn't the memory available for PCI devices a very limited
> resource? If so, then adding extra for every device would lead to
> limitations on the number of devices that could be attached to a guest,
> which itself would be seen as a bug.

I/O port resources are very limited.

32-bit MMIO is somewhat limited.

64-bit MMIO is theoretically plentiful, we only need VM RAM + MMIO pool <= cpu physical address bits

This device requires 64-bit MMIO.  The trouble is that there is no single right answer to how big to make bridge apertures, here we need 32MB, another device could come along requiring an incremental bump, then QE could try to hot-add a GPU and we'd potentially need many GB of aperture per bridge.  If the VM provides sufficient 64-bit MMIO space then the guest OS does have the option to re-allocate a given bridge, so there's an aspect here that depends on the capabilities of the guest OS.  It's not clear to me that a default VM configuration can ever be guaranteed so support hot-add of any device, independent of the guest OS support.

A reasonable approach might be to provide a substantially increased aperture per root port (256MB?) as well as allow xml tuning of the bridge apertures with supported options.  IIRC, there's also a parameter affecting the overall 64-bit MMIO pool size which may need a multiplier based on the number of root ports configured, or potentially QEMU could expose all remaining address bits after VM RAM size as 64-bit MMIO by default.  A limiting factor might be the conflict between hot-plug memory address space and potential MMIO usage.

Comment 18 Gerd Hoffmann 2022-05-12 11:42:44 UTC
> 64-bit MMIO is theoretically plentiful, we only need VM RAM + MMIO pool <=
> cpu physical address bits

edk2 reserves 32G for 64-bit memory bars by default.
seabios takes whatever is needed.

By default hot-pluggable bridges get a minimum of 2M assigned
(more in case a device is plugged which actually needs more),
in both edk2 and seabios.

The property mentioned in comment 13 changes that default.
It's also possible to change a specific root port instead
of tweaking the global default for all root ports.

So, yes, going with moderately larger 64-bit bridge windows
like 32M or 64M shouldn't be much of a problem, address space
shortage shouldn't be an issue unless we talk about several
hundred pcie root ports.

Going with very large bridge windows (so you can hot-plug GPUs
which can have gigabyte-sized memory bars these days) would
quickly exhaust address space though.

<rant>
  The whole issue sort-off circles back to the physical address
  space problem we discuss on and off since years.  The guest
  firmware still can't reliable figure physical address space
  size, so edk2 is conservative and tries to avoid using more
  than 64G (aka phys-bits=36) to be on the safe side.
</rant>

Ideally edk2 would look at the physically address space available
and pick better defaults based on that and (for example) use a
memory window larger than 32G if possible, and also larger pcie
root port windows.

Comment 19 Gerd Hoffmann 2022-05-12 11:46:41 UTC
> <rant>
>   The whole issue sort-off circles back to the physical address
>   space problem we discuss on and off since years.  The guest
>   firmware still can't reliable figure physical address space
>   size, so edk2 is conservative and tries to avoid using more
>   than 64G (aka phys-bits=36) to be on the safe side.
> </rant>

This is bug 2084533 now.

Comment 20 Yanghang Liu 2022-07-19 06:42:45 UTC
Hi Julia,

May I ask if there is any chance that we can fix this issue on current 9.1 ? 

If so, could you please help set the ITR ?

Comment 24 Yvugenfi@redhat.com 2023-04-20 12:13:33 UTC
(In reply to Yanghang Liu from comment #20)
> Hi Julia,
> 
> May I ask if there is any chance that we can fix this issue on current 9.1 ? 
> 
> If so, could you please help set the ITR ?

Hi Yanghan,

Did you check with the changes mentioned in:

https://bugzilla.redhat.com/show_bug.cgi?id=2055123#c4

Comment 25 Yanghang Liu 2023-04-27 04:19:43 UTC
Hi Yan,

This issue can still be reproduced in the edk2-ovmf-20230301gitf80f052277c8-2.el9.noarch.


The main check point:
[1] start a Q35 + OVMF Win2022 domain

[2] hot-plug a XL710 PF into the Win2022 domain

# # virsh attach-device win2022 /tmp/device/0000\:87\:00.0.xml 
Device attached successfully

[3] check the PF status in the Win2022 domain
The Device Manager shows "The device cannot find enough free resources that it can use(code 12)"

Comment 26 Yvugenfi@redhat.com 2023-04-27 06:21:02 UTC
(In reply to Yanghang Liu from comment #25)
> Hi Yan,
> 
> This issue can still be reproduced in the
> edk2-ovmf-20230301gitf80f052277c8-2.el9.noarch.
> 
> 
> The main check point:
> [1] start a Q35 + OVMF Win2022 domain
> 
> [2] hot-plug a XL710 PF into the Win2022 domain
> 
> # # virsh attach-device win2022 /tmp/device/0000\:87\:00.0.xml 
> Device attached successfully
> 
> [3] check the PF status in the Win2022 domain
> The Device Manager shows "The device cannot find enough free resources that
> it can use(code 12)"

So this happens only with one device, right?

Comment 27 Yanghang Liu 2023-04-27 06:31:23 UTC
(In reply to Yvugenfi from comment #26)
> (In reply to Yanghang Liu from comment #25)
> > Hi Yan,
> > 
> > This issue can still be reproduced in the
> > edk2-ovmf-20230301gitf80f052277c8-2.el9.noarch.
> > 
> > 
> > The main check point:
> > [1] start a Q35 + OVMF Win2022 domain
> > 
> > [2] hot-plug a XL710 PF into the Win2022 domain
> > 
> > # # virsh attach-device win2022 /tmp/device/0000\:87\:00.0.xml 
> > Device attached successfully
> > 
> > [3] check the PF status in the Win2022 domain
> > The Device Manager shows "The device cannot find enough free resources that
> > it can use(code 12)"
> 
> So this happens only with one device, right?

Yep.

Comment 28 Igor Mammedov 2023-05-16 15:21:24 UTC
Windows usually does resource reallocation without any issues (given there is unused portion somewhere to reclaim)
So I've tried to reproduce it with XL710 and results vary depending on which root-port device end up plugged in.

SeaBIOS: hotplug works fine (it enables bridge windows for every root port).
OVMF:
 1. if firmware has enabled windows on root port (even if programmed window is too small) then Windows will reassign resources during hotplug just fine.
 2. if bridge windows aren't enabled, then Windows will not enable them as well -> not really usable bridge. And hotplug to such root-port will fail.
     Here is how root ports look before firmware jumps to OS bootloader:
 
 * non working root port:
  Bus  0, device   2, function 3:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 4.
      subordinate bus 4.
      IO range [0xf000, 0x0fff]
      memory range [0xfff00000, 0x000fffff]
      prefetchable memory range [0xfffffffffff00000, 0x000fffff]
      BAR0: 32 bit memory at 0xc224b000 [0xc224bfff].
      id "pci.4"

 * usable root port
  Bus  0, device   2, function 6:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 7.
      subordinate bus 7.
      IO range [0xd000, 0xdfff]
      memory range [0xc1e00000, 0xc1ffffff]
      prefetchable memory range [0x380000000000, 0x3807ffffffff]
      BAR0: 32 bit memory at 0xc2248000 [0xc2248fff].
      id "pci.7"

CCing firmware folks for an opinion where it goes wrong

(
qemu-kvm-7.2.0-11.el9_2.x86_64
edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch
)

Comment 29 Laszlo Ersek 2023-05-18 06:35:43 UTC
(In reply to Igor Mammedov from comment #28)

>  * non working root port:
>   Bus  0, device   2, function 3:
>     PCI bridge: PCI device 1b36:000c
>       IRQ 11, pin A
>       BUS 0.
>       secondary bus 4.
>       subordinate bus 4.
>       IO range [0xf000, 0x0fff]
>       memory range [0xfff00000, 0x000fffff]
>       prefetchable memory range [0xfffffffffff00000, 0x000fffff]

These ranges look busted. The upper boundaries are actually less than the lower boundaries. Maybe those values are leftovers from probing. Not sure.

The firmware log should be helpful, please attach it to the BZ -- it contains much information on PCI resource assignment.

Comment 30 Igor Mammedov 2023-05-18 09:16:05 UTC
(In reply to Laszlo Ersek from comment #29)
> (In reply to Igor Mammedov from comment #28)
> 
> >  * non working root port:
> >   Bus  0, device   2, function 3:
> >     PCI bridge: PCI device 1b36:000c
> >       IRQ 11, pin A
> >       BUS 0.
> >       secondary bus 4.
> >       subordinate bus 4.
> >       IO range [0xf000, 0x0fff]
> >       memory range [0xfff00000, 0x000fffff]
> >       prefetchable memory range [0xfffffffffff00000, 0x000fffff]
> 
> These ranges look busted. The upper boundaries are actually less than the
> lower boundaries. Maybe those values are leftovers from probing. Not sure.

I think this is indication that ranges haven't been programmed (I might be wrong though)

> 
> The firmware log should be helpful, please attach it to the BZ -- it
> contains much information on PCI resource assignment.

Can you point me to 'how to' do that?

Comment 31 Laszlo Ersek 2023-05-18 12:35:42 UTC
Yes, of course; sorry for not describing it at once.

* If you have libvirt 8.1+, then:

<domain type='kvm'>
  <devices>
    <serial type='file'>
      <target type='isa-debug'/>
      <address type='isa' iobase='0x402'/>
      <source path='/tmp/DOMAIN-ovmf.log'/>
    </serial>
  </devices>
</domain>

(update "DOMAIN" in the above snippet as necessary)

* If you have libvirt <= 8.0, then:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <qemu:commandline>
    <qemu:arg value='-chardev'/>
    <qemu:arg value='file,id=debugfile,path=/tmp/DOMAIN-ovmf.log'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='isa-debugcon,iobase=0x402,chardev=debugfile'/>
  </qemu:commandline>
</domain>

(Again, update "DOMAIN" as necessary, *plus* don't forget to add the xmlns:qemu attribute (namespace definition) to the root <domain> element, as shown above! Otherwise the "qemu:" namespace prefix in the <qemu:arg> elements will not work!)

* Using the QEMU command line:

  -chardev file,id=debugfile,path=/tmp/DOMAIN-ovmf.log \
  -device isa-debugcon,iobase=0x402,chardev=debugfile \

Comment 32 Igor Mammedov 2023-05-19 09:34:33 UTC
Created attachment 1965639 [details]
fw log with pci.1 bridge no beining intialized properly

Comment 33 Igor Mammedov 2023-05-19 09:39:24 UTC
No need for a host with fancy NIC and PCI passthrough.

reproduces with upstream on my RHEL8.9 host,
minimal reproducer is:

./qemu-system-x86_64 \ 
 -monitor stdio \ 
 -drive if=pflash,format=raw,unit=0,readonly=on,file=./pc-bios/edk2-x86_64-code.fd \ 
 -machine q35 \ 
 -accel kvm \ 
 -cpu host \ 
 -m 4096 \ 
 -nodefaults \ 
 -device pcie-root-port,port=16,chassis=1,id=pci.1,bus=pcie.0,multifunction=true,addr=0x2 \ 
 -device pcie-root-port,port=17,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \ 
 -device pcie-root-port,port=18,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \ 
 -device pcie-root-port,port=19,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \ 
 -device pcie-root-port,port=20,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \ 
 -device pcie-root-port,port=21,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \ 
 -device pcie-root-port,port=22,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \ 
 -device pcie-root-port,port=23,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x7 \ 
 -device pcie-root-port,port=24,chassis=9,id=pci.9,bus=pcie.0,multifunction=true,addr=0x3 \ 
 -device pcie-root-port,port=25,chassis=10,id=pci.10,bus=pcie.0,addr=0x3.0x1 \ 
 -device pcie-root-port,port=26,chassis=11,id=pci.11,bus=pcie.0,addr=0x3.0x2 \ 
 -chardev file,id=debugfile,path=/tmp/DOMAIN-ovmf.log \ 
 -device isa-debugcon,iobase=0x402,chardev=debugfile

---
  Bus  0, device   2, function 0:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 1.
      subordinate bus 1.
      IO range [0xf000, 0x0fff]
      memory range [0xfff00000, 0x000fffff]
      prefetchable memory range [0xfffffffffff00000, 0x000fffff]
      BAR0: 32 bit memory at 0xc140b000 [0xc140bfff].
      id "pci.1"

OVMF log is attached.

bonus points:
  1. Booted with TCG/no -cpu, same as above (modulo addresses for initialized bridges are different (32bit))
  2. funnily If one keeps KVM but drops '-cpu host' altogether, non of root-ports are initialized

Comment 34 Laszlo Ersek 2023-05-19 10:13:01 UTC
Your "bonus points" #2 is quite telling here, so I'm going to hazard a guess even before I look at the OVMF log:

The CPU model may play a part here because OVMF now fetches the "physical address width" from CPUID, and sizes the 64-bit MMIO aperture accordingly.

Again this is just a guess, for now, I'll attempt to look at the log later, if Gerd doesn't beat me to it.

Comment 35 Laszlo Ersek 2023-05-19 10:43:07 UTC
Based on the log, you are running out of IO Port space.

(1) The port is correctly discovered:

> PciBus: Discovered PPB @ [00|02|00]  [VID = 0x1B36, DID = 0xC]
>    Padding: Type = PMem64; Alignment = 0x7FFFFFFFF;	Length = 0x800000000
>    Padding: Type =  Mem32; Alignment = 0x1FFFFF;	Length = 0x200000
>    Padding: Type =     Io; Alignment = 0x1FF;	Length = 0x200
>    BAR[0]: Type =  Mem32; Alignment = 0xFFF;	Length = 0x1000;	Offset = 0x10

(2) The 11 ports (bridges) altogether attempt to get 11 * 4KB IO Port
space:

> PciHostBridge: SubmitResources for PciRoot(0x0)
>  I/O: Granularity/SpecificFlag = 0 / 01
>       Length/Alignment = 0xB000 / 0xFFF

(3) That can't work; earlier in the log, we record the aperture
available on Q35 -- note the "Io" entry:

> PciHostBridgeUtilityInitRootBridge: populated root bus 0, with room for 255 subordinate bus(es)
> RootBridge: PciRoot(0x0)
>   Support/Attr: 70069 / 70069
>     DmaAbove4G: No
> NoExtConfSpace: No
>      AllocAttr: 3 (CombineMemPMem Mem64Decode)
>            Bus: 0 - FF Translation=0
>             Io: 6000 - FFFF Translation=0
>            Mem: C0000000 - FBFFFFFF Translation=0
>     MemAbove4G: 380000000000 - 3FFFFFFFFFFF Translation=0
>           PMem: FFFFFFFFFFFFFFFF - 0 Translation=0
>    PMemAbove4G: FFFFFFFFFFFFFFFF - 0 Translation=0

That's 10 * 4KB.

(4) Error message(s):

>   I/O: Base/Length/Alignment = FFFFFFFFFFFFFFFF/B000/FFF - Out Of Resource!
> ...
> PciHostBridge: Resource conflict happens!
> ...
> PciBus: HostBridge->NotifyPhase(AllocateResources) - Out of Resources
> PciBus: [00|02|00] was rejected due to resource confliction.

(Note especially the last line quoted.)

(5) Regarding the non-rejected ports, the IO space is handed out as
follows:

> PciBus: Resource Map for Root Bridge PciRoot(0x0)
> Type =   Io16; Base = 0x6000;	Length = 0xA000;	Alignment = 0xFFF
>    Base = 0x6000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|03|02:**]
>    Base = 0x7000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|03|01:**]
>    Base = 0x8000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|03|00:**]
>    Base = 0x9000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|02|07:**]
>    Base = 0xA000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|02|06:**]
>    Base = 0xB000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|02|05:**]
>    Base = 0xC000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|02|04:**]
>    Base = 0xD000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|02|03:**]
>    Base = 0xE000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|02|02:**]
>    Base = 0xF000;	Length = 0x200;	Alignment = 0xFFF;	Owner = PPB [00|02|01:**]

So, in this particular case, you run out of IO Port Space, and that
prevents one of the root ports from being setup. Whereas under "bonus
point #2" in comment#33, you likely ran out of MMIO with regard to *all*
ports.

For remedying the IO Port space problem, configure all the ports such
that they ask for no IO at all (all PCIe devices are required to work
without IO BARs). The QEMU option for this is:

  -global pcie-root-port.io-reserve=0

Regarding the libvirt domain XML, I'm unaware of any element or
attribute that can do this. That feature was the topic of bug#1408810,
but it was closed as WONTFIX ultimately. So, in a domain XML, you can
only insert, for now (see comment#31):

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.io-reserve=0'/>
  </qemu:commandline>
</domain>

In summary, there are multiple *kinds* of resources that we can run out
of.

- We can run out of IO if too many root ports are configured on the QEMU
  cmdline, or in the domain XML, and IO reservation is not disabled on
  those root ports.

- We can (independently) run out of 64-bit MMIO, if the 64-bit MMIO
  aperture established by OVMF is too small, relative to how large BARs,
  and how many BARs, the individual endpoints request cumulatively --
  and this can happen if the VCPU model reports a too small physical
  address width to OVMF, via CPUID.

- We can (independently) run out of 32-bit MMIO as well. The 32-bit MMIO
  aperture is fixed in OVMF (namely fixed to board-dependent constants:
  i440fx differs from Q35; I forget the exact details, and I think Gerd
  has rearranged the low memmap anyway recently). There's no way to
  increase this aperture, and there's also no way to expect devices to
  work without their 32-bit MMIO BARs. So this is a hard limit, but in
  practice, I've never seen an issue with this. Devices usually require
  pretty small and few 32-bit BARs.

(Side reminder: non-prefetchable MMIO is 32-bit only, while prefetchable
MMIO may be *either* 32-bit *or* 64-bit, but not both. Furthermore, OVMF
does not maintain separate prefetchable and non-prefetchable apertures;
it combines them. OVMF has one 32-bit aperture that can accommodate the
32-bit-only non-pref MMIO, plus the pref MMIO *if* the pref MMIO is
32-bit too; and OVMF has one 64-bit aperture that can accommodate the
64-bit pref MMIO *if* the pref MMIO is 64-bit.)

Either way, I'm not sure that Igor's latest reproducer, from comment#33,
actually reproduces the originally reported issue (comment#0). I
recommend for the original symptom to be reproduced, and then please
upload the firmware log *that* belongs to that failed test.

Comment 36 Laszlo Ersek 2023-05-19 10:45:17 UTC
(In reply to Laszlo Ersek from comment #35)

> Either way, I'm not sure that Igor's latest reproducer, from comment#33,
> actually reproduces the originally reported issue (comment#0). I
> recommend for the original symptom to be reproduced, and then please
> upload the firmware log *that* belongs to that failed test.

Sorry, I'm in a rush, and misplaced the emphasis in the last phrase. It should go like this:

I recommend for the original symptom to be reproduced, and then please upload the firmware log that belongs to *THAT* failed test.

Comment 37 Laszlo Ersek 2023-05-19 10:58:06 UTC
Ideally:

* Disable IO Port space reservation on all the PCIe root ports (libvirt doesn't directly support this at the moment, so you'll have to hack in the <qemu:arg> elements; see above).

  -global pcie-root-port.io-reserve=0

* Similarly, on all PCIe root ports, specify a large 64-bit MMIO reservation. Such that any one of these ports can accept the device that (a) you intend to hotplug later and that (b) has the largest *cumulative* 64-bit MMIO BAR demands. For example, reserve 256GB for each port:

  -global pcie-root-port.pref64-reserve=0x4000000000

- set up the VCPU model (including using KVM) such that it exposes a *large* phys address width, such as 46 bits, or more. I'm unsure about the logic in OVMF that derives the 64-bit aperture size from the phys address with, but from the log in comment 32, phys addr width = 46 bits results in 8192 GB 64-bit MMIO space, so you should be able to fit like 32 ports / devices (256GB reserved) in that.

Again, unfortunately, there are many different limits here, and unless you consult the firmware log every time, you won't really know which one you hit.

Comment 38 Igor Mammedov 2023-05-19 12:18:10 UTC
(In reply to Laszlo Ersek from comment #36)
> (In reply to Laszlo Ersek from comment #35)
> 
> > Either way, I'm not sure that Igor's latest reproducer, from comment#33,
> > actually reproduces the originally reported issue (comment#0). I
> > recommend for the original symptom to be reproduced, and then please
> > upload the firmware log *that* belongs to that failed test.
> 
> Sorry, I'm in a rush, and misplaced the emphasis in the last phrase. It
> should go like this:
> 
> I recommend for the original symptom to be reproduced, and then please
> upload the firmware log that belongs to *THAT* failed test.

Hotplug fails *because* plug happens into uninitialized port,
plugging into properly initialized port worked fine.
So I don't think it's PCI pass-through issue/NIC related issue.

As long as port is initialized (even with small bridge windows),
hotplug works since Windows dynamically re-assigns unused resources.

doesn't help:  -global pcie-root-port.pref64-reserve=0x4000000000

What helps is "-global pcie-root-port.io-reserve=0",
so question is:
 1. how many root-ports are too much
 2. whether virt-install/virt-manager should create that many when using UEFI
 3. should OVMF abandon port if there is not enough IO (PCIE devices behind root-port should work even without IO resources)
 4. why IO resources are exhausted on such small trivial configuration.

Anyways,
I'll re-provision Windows VM and get log in config reported by comment 0

Comment 39 Igor Mammedov 2023-05-19 12:24:50 UTC
(In reply to Laszlo Ersek from comment #37)
[...]
> 
> - set up the VCPU model (including using KVM) such that it exposes a *large*
> phys address width, such as 46 bits, or more. I'm unsure about the logic in
> OVMF that derives the 64-bit aperture size from the phys address with, but
> from the log in comment 32, phys addr width = 46 bits results in 8192 GB
> 64-bit MMIO space, so you should be able to fit like 32 ports / devices
> (256GB reserved) in that.

not likely, this even happens with rhel9.2 ovmf build that should not have
recent address space improvements, all allocated bridge mem[64] windows are
relatively small and should fit fine

PS:
edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch

Comment 40 Igor Mammedov 2023-05-19 13:29:00 UTC
Created attachment 1965696 [details]
OVMF log with vfio_hotplug_config

Comment 41 Igor Mammedov 2023-05-19 13:32:19 UTC
Created attachment 1965697 [details]
domain config vfio_hotplug_config.xml

Used with guest WS2022 Data center ed.

Comment 42 Igor Mammedov 2023-05-19 13:46:08 UTC
Hotplugged host's XL710 nic:
with following snippet which should put NIC on uninitialized root-port (pci.4)

<hostdev mode="subsystem" type="pci" managed="yes">
  <driver name="vfio"/>
  <source>
    <address domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
  </source>
  <alias name="hostdev0"/>
  <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
</hostdev>

# virsh attach-device win2k22 xl710.xml

result from QEMU side:

  Bus  0, device   2, function 3:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 4.
      subordinate bus 4.
      IO range [0xf000, 0x0fff]
      memory range [0xfff00000, 0x000fffff]
      prefetchable memory range [0xfff00000, 0x000fffff]
      BAR0: 32 bit memory at 0xc224b000 [0xc224bfff].
      id "pci.4"
  Bus  4, device   0, function 0:
    Ethernet controller: PCI device 8086:1583
      PCI subsystem 8086:0006
      BAR0: 64 bit prefetchable memory at 0xffffffffffffffff [0x00fffffe].
      BAR3: 64 bit prefetchable memory at 0xffffffffffffffff [0x00007ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0007fffe].
      id "hostdev0"

from guest side, error: "not enough free resources'

As for edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch peaking at hostbits to decide on size of reserved prefetchable window
it actually might be happening (see pref address range in comment 28, which is ~32G).

Comment 43 Igor Mammedov 2023-05-19 13:58:40 UTC
Another experiment to confirm that resource reassignment in Windows works:
usable port after boot (with forced 2Mb prefetch window):


   Bus  0, device   2, function 6:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 7.
      subordinate bus 7.
      IO range [0xd000, 0xdfff]
      memory range [0xc1e00000, 0xc1ffffff]
      prefetchable memory range [0x380000000000, 0x3800001fffff]
      BAR0: 32 bit memory at 0xc2248000 [0xc2248fff].
      id "pci.7"

after NIC (with larger BAR) hotplug window is resized and hotplug is successful:

  Bus  0, device   2, function 6:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 7.
      subordinate bus 7.
      IO range [0xd000, 0xdfff]
      memory range [0xc1e00000, 0xc1ffffff]
      prefetchable memory range [0x3807fe000000, 0x3807ff0fffff]
      BAR0: 32 bit memory at 0xc2248000 [0xc2248fff].
      id "pci.7"
  Bus  7, device   0, function 0:
    Ethernet controller: PCI device 8086:1583
      PCI subsystem 8086:0006
      BAR0: 64 bit prefetchable memory at 0x3807fe000000 [0x3807feffffff].
      BAR3: 64 bit prefetchable memory at 0x3807ff0f8000 [0x3807ff0fffff].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0007fffe].
      id "hostdev0"

Comment 44 Laszlo Ersek 2023-05-19 15:03:28 UTC
(In reply to Igor Mammedov from comment #38)

> doesn't help:  -global pcie-root-port.pref64-reserve=0x4000000000

Meanwhile it has occurred to me that Gerd had made even this logic dynamic in OVMF -- IIRC not only is the aperture sized dyanmically now (based on VCPU phys address width), but the 64-bit MMIO reservation for root ports, too.

> What helps is "-global pcie-root-port.io-reserve=0",

OK!

> so question is:
>  1. how many root-ports are too much

10 should work, 11 are too much.

>  2. whether virt-install/virt-manager should create that many when using UEFI

Not sure. The number 10 is arbitrary, it's the largest contiguous port space (0x6000 and upwards) that we had managed to carve out, given the Q35 board's fixed IO ports.

>  3. should OVMF abandon port if there is not enough IO

This is already what's happening. It's performed by the generic PCI bus driver in edk2 that OVMF uses. When the driver runs out of resources, it attempts to identify the most resource-hungry device(s), and to abandon those. Thereby allowing multiple, less-hungry, devices to be configured.

> (PCIE devices behind root-port should work even without IO resources)

Correct, but it's not that simple, of course :)

OVMF uses edk2's generic PciBusDxe driver. OVMF has some platform code that can "guide" the generic PCI bus driver.

By default, the PCI bus driver allocates 4KB IO for a bridge. Be it a traditional PCI-to-PCI bridge, or a PCI Express root port.

(NB, if you try to configure a bunch of (conventional) PCI bridges on i440fx, you'll get the same problem -- in fact you'll hit it much earlier, because on the i440fx board, the largest contiguous IO port space (aperture) is much smaller than on Q35, so you'll see problems after 4-5 bridges or so. Of course, you don't *need* that many bridges on i440fx, so in practice people never see that issue.)

OVMF can convince the edk2 PCI bus driver not to reserve IO port space (and this works for individual bridges / root ports), but it doesn't do that *by default*. The defaults were chosen several years ago. See <https://github.com/tianocore/edk2/commit/fe4049471bdf>.

The generic PCI bus driver will *not* do the following: attempt to allocate 4KB IO for a root port (by default behavior), fail to allocate the IO, but still configure the port (*without* IO), and the expect devices behind the root port to work without IO. This is not going to happen, the generic PCI bus driver in edk2 just doesn't work this way. If any resource demand for any bridge cannot be satisfied, then the driver will enter this state where it will try to kick out (leave unconfigured) the most resource-hungry devices / bridges. The bridge triggering the "out of resources" condition may not be the bridge that gets left behind. (A bit similarly to how the OOM killer in Linux works -- it's not necessarily the tiny process that runs out of memory that gets reaped.)

The following does work: the user sets the IO reservation hint on the root port to zero, OVMF finds that in a PCI capability of the root port, and steers the generic PCI bus driver accordingly. Then the PCI bus driver doesn't even *attempt* to allocate IO for the root port, so it can never run out of IO space -- for that particular bridge anyways. If *none* of the bridges / endpoints run out of resources, then nothing will be abandoned, and then we can expect all devices behind those ports to work, even not having IO.

So you can control the IO reservation on a port by port basis, but in general, setting the property to zero, with a -global option, is the most sensible idea.

>  4. why IO resources are exhausted on such small trivial configuration.

It's not trivial. 11 root ports with each of them requiring IO (and the generic PCI bus driver insisting on a 4KB alignment for each IO block) is not trivial.

Comment 45 Laszlo Ersek 2023-05-19 15:13:44 UTC
(In reply to Igor Mammedov from comment #40)
> Created attachment 1965696 [details]
> OVMF log with vfio_hotplug_config

Same issue, just worse. 14 root ports requiring 14 * 4KB IO space. From those, 00:02.1 through 00:02.5 are rejected (5 ports), while 00:02.0, 00:02.6, 00:02.7, plus 00:03.0 through 00:03.5 (9 ports) are allocated fine. The 9 * 4KB space [0x6000 .. 0xEFFF] gets assigned to those root ports, and the 1 * 4KB space at [0xF000 .. 0xFFFF] belongs to the root complex / root bridge (that's where devices integrated in the root complex get their IO BARs assigned from).

Comment 46 Laszlo Ersek 2023-05-19 15:24:53 UTC
(In reply to Igor Mammedov from comment #42)
> Hotplugged host's XL710 nic:
> with following snippet which should put NIC on uninitialized root-port
> (pci.4)
> 
> <hostdev mode="subsystem" type="pci" managed="yes">
>   <driver name="vfio"/>
>   <source>
>     <address domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>

(this is the host-side B/D/F, IIUC)

>   </source>
>   <alias name="hostdev0"/>
>   <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>

(this is the guest-side B/D/F)

So bus="0x04" here corresponds to index='4' from comment#41:

    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>

The PCI B/D/F for that root port is 00:02.3. Indeed, that root port is rejected by the firmware, because there isn't enough IO available.

> </hostdev>
> 
> # virsh attach-device win2k22 xl710.xml
> 
> result from QEMU side:
> 
>   Bus  0, device   2, function 3:
>     PCI bridge: PCI device 1b36:000c
>       IRQ 11, pin A
>       BUS 0.
>       secondary bus 4.
>       subordinate bus 4.
>       IO range [0xf000, 0x0fff]
>       memory range [0xfff00000, 0x000fffff]
>       prefetchable memory range [0xfff00000, 0x000fffff]
>       BAR0: 32 bit memory at 0xc224b000 [0xc224bfff].
>       id "pci.4"
>   Bus  4, device   0, function 0:
>     Ethernet controller: PCI device 8086:1583
>       PCI subsystem 8086:0006
>       BAR0: 64 bit prefetchable memory at 0xffffffffffffffff [0x00fffffe].
>       BAR3: 64 bit prefetchable memory at 0xffffffffffffffff [0x00007ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0007fffe].
>       id "hostdev0"
> 
> from guest side, error: "not enough free resources'

Right, all that looks consistent.

> As for edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch peaking at hostbits to
> decide on size of reserved prefetchable window
> it actually might be happening (see pref address range in comment 28, which
> is ~32G).

Right: "prefetchable memory range [0x380000000000, 0x3807ffffffff]" -- this is confirmed by the comment#32 log too:

> PlatformDynamicMmioWindow:   Pci64 Base 0x380000000000
> PlatformDynamicMmioWindow:   Pci64 Size 0x80000000000
> ...

Anyway I think all of this stuff should just work now if you set the IO reservation for all root ports to zero:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.io-reserve=0'/>
  </qemu:commandline>
</domain>

... Gerd may also decide, going forward, that the firmware should not reserve IO *by default* for PCIe root ports. I'm not sure, perhaps that's the practical expectation nowadays.

Comment 47 Igor Mammedov 2023-05-22 10:22:02 UTC
Just tested with SeaBIOS, and it does a better job of distributing IO resources.
All 30 ports come out properly initialized.

Anyways it's not QEMU issue, reassigning it to OVMF for deciding
where/whether/how it should be fixed.

Comment 48 Gerd Hoffmann 2023-05-22 10:29:29 UTC
> ... Gerd may also decide, going forward, that the firmware should not
> reserve IO *by default* for PCIe root ports. I'm not sure, perhaps that's
> the practical expectation nowadays.

Already came to that conclusion too ;)

See https://bugzilla.redhat.com/show_bug.cgi?id=2174749#c18

Has links to patch + scratch build.
There is a fair chance that the scratch build fixes this bug too.

Comment 49 Igor Mammedov 2023-05-22 10:50:44 UTC
(In reply to Gerd Hoffmann from comment #48)
> > ... Gerd may also decide, going forward, that the firmware should not
> > reserve IO *by default* for PCIe root ports. I'm not sure, perhaps that's
> > the practical expectation nowadays.
> 
> Already came to that conclusion too ;)
> 
> See https://bugzilla.redhat.com/show_bug.cgi?id=2174749#c18
> 
> Has links to patch + scratch build.
> There is a fair chance that the scratch build fixes this bug too.

I have tried scratch build,
looks like it should help
(all empty root-ports have IO disabled and memory ranges initialized)

Comment 50 Gerd Hoffmann 2023-05-22 14:27:19 UTC
> > See https://bugzilla.redhat.com/show_bug.cgi?id=2174749#c18
> > 
> > Has links to patch + scratch build.

> I have tried scratch build,
> looks like it should help
> (all empty root-ports have IO disabled and memory ranges initialized)

Which is expected behaviour.  Good.

Comment 52 Gerd Hoffmann 2023-06-01 08:39:28 UTC
(In reply to Gerd Hoffmann from comment #50)
> > > See https://bugzilla.redhat.com/show_bug.cgi?id=2174749#c18
> > > 
> > > Has links to patch + scratch build.
> 
> > I have tried scratch build,
> > looks like it should help
> > (all empty root-ports have IO disabled and memory ranges initialized)
> 
> Which is expected behaviour.  Good.

So the fix is the same we need for 2203094.

Keeping the bugs separate because the test szenarios are quite different.
So adding dependency and marking as testonly instead of closing as dup.

Comment 53 Red Hat Bugzilla 2023-10-29 04:25:02 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.