Bug 2024818
Summary: | [Windows_vm][Q35+ OVMF] Some hot-plugged PF/VF can not find enough free resources that it can use | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Yanghang Liu <yanghliu> | ||||||||
Component: | edk2 | Assignee: | Gerd Hoffmann <kraxel> | ||||||||
Status: | CLOSED MIGRATED | QA Contact: | Yanghang Liu <yanghliu> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 9.0 | CC: | ailan, alex.williamson, berrange, chayang, coli, imammedo, jinzhao, jusual, juzhang, kraxel, laine, lersek, mkedzier, mst, nilal, pbonzini, virt-maint, xuwei, yama, yvugenfi | ||||||||
Target Milestone: | rc | Keywords: | CustomerScenariosInitiative, MigratedToJIRA, TestOnly, Triaged | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 2084533 (view as bug list) | Environment: | |||||||||
Last Closed: | 2023-06-30 18:38:35 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 2203094 | ||||||||||
Bug Blocks: | 2084533 | ||||||||||
Attachments: |
|
Description
Yanghang Liu
2021-11-19 07:04:50 UTC
> This problem *can still be* reproced when adding "-global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off" into the domain I know there are several similar Q35 + hot-plug bugs that have not been fixed yet. Bug 2001732 - [virtual network][qemu-6.1.0-1] Fail to hotplug nic with rtl8139 driver Bug 2001719 - fail to hotplug NIC with edk2 firmware Bug 2004829 - [ovmf] The guest does not present hot-plugged disk Bug 2007129 - pcie hotplug emulation has various problems due to insufficient state tracking ... But it seems to me that this bug *may has a different root cause* with above bugs because *this bug can still be reproduced after adding '-global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off' into the domain*. I will do related tests as well if the above bugs have been fixed first. The bug can still be reproduced in the following test environment: qemu-kvm-6.2.0-1.el9.x86_64 libvirt-7.10.0-1.el9.x86_64 edk2-ovmf-20210527gite1999b264f1f-7.el9.noarch seabios-bin-1.14.0-7.el9.noarch 5.14.0-39.el9.x86_64 This bug can still be reproduced in the following test env: 5.14.0-78.el9.x86_64 qemu-kvm-7.0.0-0.rc3.el9.wrb220406.x86_64 edk2-ovmf-20220221gitb24306f15d-1.el9.noarch seabios-bin-1.15.0-1.el9.noarch A workaround: This PF/VF can be hot-plug into the vm after adding the following setup into the vm cfg: -global pcie-root-port.pref64-reserve=64M or <qemu:commandline> <qemu:arg value='-global'/> <qemu:arg value='pcie-root-port.pref64-reserve=64M'/> </qemu:commandline> A similar bug which is tracking the RHEL VM: Bug 2055123 - [Q35] Failed to hot-plug a device whose membar > 2M into the vm (In reply to Laine Stump from comment #15) > I don't consider myself qualified to provide a "safe" answer to that > question, but isn't the memory available for PCI devices a very limited > resource? If so, then adding extra for every device would lead to > limitations on the number of devices that could be attached to a guest, > which itself would be seen as a bug. I/O port resources are very limited. 32-bit MMIO is somewhat limited. 64-bit MMIO is theoretically plentiful, we only need VM RAM + MMIO pool <= cpu physical address bits This device requires 64-bit MMIO. The trouble is that there is no single right answer to how big to make bridge apertures, here we need 32MB, another device could come along requiring an incremental bump, then QE could try to hot-add a GPU and we'd potentially need many GB of aperture per bridge. If the VM provides sufficient 64-bit MMIO space then the guest OS does have the option to re-allocate a given bridge, so there's an aspect here that depends on the capabilities of the guest OS. It's not clear to me that a default VM configuration can ever be guaranteed so support hot-add of any device, independent of the guest OS support. A reasonable approach might be to provide a substantially increased aperture per root port (256MB?) as well as allow xml tuning of the bridge apertures with supported options. IIRC, there's also a parameter affecting the overall 64-bit MMIO pool size which may need a multiplier based on the number of root ports configured, or potentially QEMU could expose all remaining address bits after VM RAM size as 64-bit MMIO by default. A limiting factor might be the conflict between hot-plug memory address space and potential MMIO usage. > 64-bit MMIO is theoretically plentiful, we only need VM RAM + MMIO pool <= > cpu physical address bits edk2 reserves 32G for 64-bit memory bars by default. seabios takes whatever is needed. By default hot-pluggable bridges get a minimum of 2M assigned (more in case a device is plugged which actually needs more), in both edk2 and seabios. The property mentioned in comment 13 changes that default. It's also possible to change a specific root port instead of tweaking the global default for all root ports. So, yes, going with moderately larger 64-bit bridge windows like 32M or 64M shouldn't be much of a problem, address space shortage shouldn't be an issue unless we talk about several hundred pcie root ports. Going with very large bridge windows (so you can hot-plug GPUs which can have gigabyte-sized memory bars these days) would quickly exhaust address space though. <rant> The whole issue sort-off circles back to the physical address space problem we discuss on and off since years. The guest firmware still can't reliable figure physical address space size, so edk2 is conservative and tries to avoid using more than 64G (aka phys-bits=36) to be on the safe side. </rant> Ideally edk2 would look at the physically address space available and pick better defaults based on that and (for example) use a memory window larger than 32G if possible, and also larger pcie root port windows. > <rant> > The whole issue sort-off circles back to the physical address > space problem we discuss on and off since years. The guest > firmware still can't reliable figure physical address space > size, so edk2 is conservative and tries to avoid using more > than 64G (aka phys-bits=36) to be on the safe side. > </rant> This is bug 2084533 now. Hi Julia, May I ask if there is any chance that we can fix this issue on current 9.1 ? If so, could you please help set the ITR ? (In reply to Yanghang Liu from comment #20) > Hi Julia, > > May I ask if there is any chance that we can fix this issue on current 9.1 ? > > If so, could you please help set the ITR ? Hi Yanghan, Did you check with the changes mentioned in: https://bugzilla.redhat.com/show_bug.cgi?id=2055123#c4 Hi Yan, This issue can still be reproduced in the edk2-ovmf-20230301gitf80f052277c8-2.el9.noarch. The main check point: [1] start a Q35 + OVMF Win2022 domain [2] hot-plug a XL710 PF into the Win2022 domain # # virsh attach-device win2022 /tmp/device/0000\:87\:00.0.xml Device attached successfully [3] check the PF status in the Win2022 domain The Device Manager shows "The device cannot find enough free resources that it can use(code 12)" (In reply to Yanghang Liu from comment #25) > Hi Yan, > > This issue can still be reproduced in the > edk2-ovmf-20230301gitf80f052277c8-2.el9.noarch. > > > The main check point: > [1] start a Q35 + OVMF Win2022 domain > > [2] hot-plug a XL710 PF into the Win2022 domain > > # # virsh attach-device win2022 /tmp/device/0000\:87\:00.0.xml > Device attached successfully > > [3] check the PF status in the Win2022 domain > The Device Manager shows "The device cannot find enough free resources that > it can use(code 12)" So this happens only with one device, right? (In reply to Yvugenfi from comment #26) > (In reply to Yanghang Liu from comment #25) > > Hi Yan, > > > > This issue can still be reproduced in the > > edk2-ovmf-20230301gitf80f052277c8-2.el9.noarch. > > > > > > The main check point: > > [1] start a Q35 + OVMF Win2022 domain > > > > [2] hot-plug a XL710 PF into the Win2022 domain > > > > # # virsh attach-device win2022 /tmp/device/0000\:87\:00.0.xml > > Device attached successfully > > > > [3] check the PF status in the Win2022 domain > > The Device Manager shows "The device cannot find enough free resources that > > it can use(code 12)" > > So this happens only with one device, right? Yep. Windows usually does resource reallocation without any issues (given there is unused portion somewhere to reclaim) So I've tried to reproduce it with XL710 and results vary depending on which root-port device end up plugged in. SeaBIOS: hotplug works fine (it enables bridge windows for every root port). OVMF: 1. if firmware has enabled windows on root port (even if programmed window is too small) then Windows will reassign resources during hotplug just fine. 2. if bridge windows aren't enabled, then Windows will not enable them as well -> not really usable bridge. And hotplug to such root-port will fail. Here is how root ports look before firmware jumps to OS bootloader: * non working root port: Bus 0, device 2, function 3: PCI bridge: PCI device 1b36:000c IRQ 11, pin A BUS 0. secondary bus 4. subordinate bus 4. IO range [0xf000, 0x0fff] memory range [0xfff00000, 0x000fffff] prefetchable memory range [0xfffffffffff00000, 0x000fffff] BAR0: 32 bit memory at 0xc224b000 [0xc224bfff]. id "pci.4" * usable root port Bus 0, device 2, function 6: PCI bridge: PCI device 1b36:000c IRQ 11, pin A BUS 0. secondary bus 7. subordinate bus 7. IO range [0xd000, 0xdfff] memory range [0xc1e00000, 0xc1ffffff] prefetchable memory range [0x380000000000, 0x3807ffffffff] BAR0: 32 bit memory at 0xc2248000 [0xc2248fff]. id "pci.7" CCing firmware folks for an opinion where it goes wrong ( qemu-kvm-7.2.0-11.el9_2.x86_64 edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch ) (In reply to Igor Mammedov from comment #28) > * non working root port: > Bus 0, device 2, function 3: > PCI bridge: PCI device 1b36:000c > IRQ 11, pin A > BUS 0. > secondary bus 4. > subordinate bus 4. > IO range [0xf000, 0x0fff] > memory range [0xfff00000, 0x000fffff] > prefetchable memory range [0xfffffffffff00000, 0x000fffff] These ranges look busted. The upper boundaries are actually less than the lower boundaries. Maybe those values are leftovers from probing. Not sure. The firmware log should be helpful, please attach it to the BZ -- it contains much information on PCI resource assignment. (In reply to Laszlo Ersek from comment #29) > (In reply to Igor Mammedov from comment #28) > > > * non working root port: > > Bus 0, device 2, function 3: > > PCI bridge: PCI device 1b36:000c > > IRQ 11, pin A > > BUS 0. > > secondary bus 4. > > subordinate bus 4. > > IO range [0xf000, 0x0fff] > > memory range [0xfff00000, 0x000fffff] > > prefetchable memory range [0xfffffffffff00000, 0x000fffff] > > These ranges look busted. The upper boundaries are actually less than the > lower boundaries. Maybe those values are leftovers from probing. Not sure. I think this is indication that ranges haven't been programmed (I might be wrong though) > > The firmware log should be helpful, please attach it to the BZ -- it > contains much information on PCI resource assignment. Can you point me to 'how to' do that? Yes, of course; sorry for not describing it at once. * If you have libvirt 8.1+, then: <domain type='kvm'> <devices> <serial type='file'> <target type='isa-debug'/> <address type='isa' iobase='0x402'/> <source path='/tmp/DOMAIN-ovmf.log'/> </serial> </devices> </domain> (update "DOMAIN" in the above snippet as necessary) * If you have libvirt <= 8.0, then: <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <qemu:commandline> <qemu:arg value='-chardev'/> <qemu:arg value='file,id=debugfile,path=/tmp/DOMAIN-ovmf.log'/> <qemu:arg value='-device'/> <qemu:arg value='isa-debugcon,iobase=0x402,chardev=debugfile'/> </qemu:commandline> </domain> (Again, update "DOMAIN" as necessary, *plus* don't forget to add the xmlns:qemu attribute (namespace definition) to the root <domain> element, as shown above! Otherwise the "qemu:" namespace prefix in the <qemu:arg> elements will not work!) * Using the QEMU command line: -chardev file,id=debugfile,path=/tmp/DOMAIN-ovmf.log \ -device isa-debugcon,iobase=0x402,chardev=debugfile \ Created attachment 1965639 [details]
fw log with pci.1 bridge no beining intialized properly
No need for a host with fancy NIC and PCI passthrough. reproduces with upstream on my RHEL8.9 host, minimal reproducer is: ./qemu-system-x86_64 \ -monitor stdio \ -drive if=pflash,format=raw,unit=0,readonly=on,file=./pc-bios/edk2-x86_64-code.fd \ -machine q35 \ -accel kvm \ -cpu host \ -m 4096 \ -nodefaults \ -device pcie-root-port,port=16,chassis=1,id=pci.1,bus=pcie.0,multifunction=true,addr=0x2 \ -device pcie-root-port,port=17,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \ -device pcie-root-port,port=18,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \ -device pcie-root-port,port=19,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \ -device pcie-root-port,port=20,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \ -device pcie-root-port,port=21,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \ -device pcie-root-port,port=22,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \ -device pcie-root-port,port=23,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x7 \ -device pcie-root-port,port=24,chassis=9,id=pci.9,bus=pcie.0,multifunction=true,addr=0x3 \ -device pcie-root-port,port=25,chassis=10,id=pci.10,bus=pcie.0,addr=0x3.0x1 \ -device pcie-root-port,port=26,chassis=11,id=pci.11,bus=pcie.0,addr=0x3.0x2 \ -chardev file,id=debugfile,path=/tmp/DOMAIN-ovmf.log \ -device isa-debugcon,iobase=0x402,chardev=debugfile --- Bus 0, device 2, function 0: PCI bridge: PCI device 1b36:000c IRQ 11, pin A BUS 0. secondary bus 1. subordinate bus 1. IO range [0xf000, 0x0fff] memory range [0xfff00000, 0x000fffff] prefetchable memory range [0xfffffffffff00000, 0x000fffff] BAR0: 32 bit memory at 0xc140b000 [0xc140bfff]. id "pci.1" OVMF log is attached. bonus points: 1. Booted with TCG/no -cpu, same as above (modulo addresses for initialized bridges are different (32bit)) 2. funnily If one keeps KVM but drops '-cpu host' altogether, non of root-ports are initialized Your "bonus points" #2 is quite telling here, so I'm going to hazard a guess even before I look at the OVMF log: The CPU model may play a part here because OVMF now fetches the "physical address width" from CPUID, and sizes the 64-bit MMIO aperture accordingly. Again this is just a guess, for now, I'll attempt to look at the log later, if Gerd doesn't beat me to it. Based on the log, you are running out of IO Port space. (1) The port is correctly discovered: > PciBus: Discovered PPB @ [00|02|00] [VID = 0x1B36, DID = 0xC] > Padding: Type = PMem64; Alignment = 0x7FFFFFFFF; Length = 0x800000000 > Padding: Type = Mem32; Alignment = 0x1FFFFF; Length = 0x200000 > Padding: Type = Io; Alignment = 0x1FF; Length = 0x200 > BAR[0]: Type = Mem32; Alignment = 0xFFF; Length = 0x1000; Offset = 0x10 (2) The 11 ports (bridges) altogether attempt to get 11 * 4KB IO Port space: > PciHostBridge: SubmitResources for PciRoot(0x0) > I/O: Granularity/SpecificFlag = 0 / 01 > Length/Alignment = 0xB000 / 0xFFF (3) That can't work; earlier in the log, we record the aperture available on Q35 -- note the "Io" entry: > PciHostBridgeUtilityInitRootBridge: populated root bus 0, with room for 255 subordinate bus(es) > RootBridge: PciRoot(0x0) > Support/Attr: 70069 / 70069 > DmaAbove4G: No > NoExtConfSpace: No > AllocAttr: 3 (CombineMemPMem Mem64Decode) > Bus: 0 - FF Translation=0 > Io: 6000 - FFFF Translation=0 > Mem: C0000000 - FBFFFFFF Translation=0 > MemAbove4G: 380000000000 - 3FFFFFFFFFFF Translation=0 > PMem: FFFFFFFFFFFFFFFF - 0 Translation=0 > PMemAbove4G: FFFFFFFFFFFFFFFF - 0 Translation=0 That's 10 * 4KB. (4) Error message(s): > I/O: Base/Length/Alignment = FFFFFFFFFFFFFFFF/B000/FFF - Out Of Resource! > ... > PciHostBridge: Resource conflict happens! > ... > PciBus: HostBridge->NotifyPhase(AllocateResources) - Out of Resources > PciBus: [00|02|00] was rejected due to resource confliction. (Note especially the last line quoted.) (5) Regarding the non-rejected ports, the IO space is handed out as follows: > PciBus: Resource Map for Root Bridge PciRoot(0x0) > Type = Io16; Base = 0x6000; Length = 0xA000; Alignment = 0xFFF > Base = 0x6000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|03|02:**] > Base = 0x7000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|03|01:**] > Base = 0x8000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|03|00:**] > Base = 0x9000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|02|07:**] > Base = 0xA000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|02|06:**] > Base = 0xB000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|02|05:**] > Base = 0xC000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|02|04:**] > Base = 0xD000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|02|03:**] > Base = 0xE000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|02|02:**] > Base = 0xF000; Length = 0x200; Alignment = 0xFFF; Owner = PPB [00|02|01:**] So, in this particular case, you run out of IO Port Space, and that prevents one of the root ports from being setup. Whereas under "bonus point #2" in comment#33, you likely ran out of MMIO with regard to *all* ports. For remedying the IO Port space problem, configure all the ports such that they ask for no IO at all (all PCIe devices are required to work without IO BARs). The QEMU option for this is: -global pcie-root-port.io-reserve=0 Regarding the libvirt domain XML, I'm unaware of any element or attribute that can do this. That feature was the topic of bug#1408810, but it was closed as WONTFIX ultimately. So, in a domain XML, you can only insert, for now (see comment#31): <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <qemu:commandline> <qemu:arg value='-global'/> <qemu:arg value='pcie-root-port.io-reserve=0'/> </qemu:commandline> </domain> In summary, there are multiple *kinds* of resources that we can run out of. - We can run out of IO if too many root ports are configured on the QEMU cmdline, or in the domain XML, and IO reservation is not disabled on those root ports. - We can (independently) run out of 64-bit MMIO, if the 64-bit MMIO aperture established by OVMF is too small, relative to how large BARs, and how many BARs, the individual endpoints request cumulatively -- and this can happen if the VCPU model reports a too small physical address width to OVMF, via CPUID. - We can (independently) run out of 32-bit MMIO as well. The 32-bit MMIO aperture is fixed in OVMF (namely fixed to board-dependent constants: i440fx differs from Q35; I forget the exact details, and I think Gerd has rearranged the low memmap anyway recently). There's no way to increase this aperture, and there's also no way to expect devices to work without their 32-bit MMIO BARs. So this is a hard limit, but in practice, I've never seen an issue with this. Devices usually require pretty small and few 32-bit BARs. (Side reminder: non-prefetchable MMIO is 32-bit only, while prefetchable MMIO may be *either* 32-bit *or* 64-bit, but not both. Furthermore, OVMF does not maintain separate prefetchable and non-prefetchable apertures; it combines them. OVMF has one 32-bit aperture that can accommodate the 32-bit-only non-pref MMIO, plus the pref MMIO *if* the pref MMIO is 32-bit too; and OVMF has one 64-bit aperture that can accommodate the 64-bit pref MMIO *if* the pref MMIO is 64-bit.) Either way, I'm not sure that Igor's latest reproducer, from comment#33, actually reproduces the originally reported issue (comment#0). I recommend for the original symptom to be reproduced, and then please upload the firmware log *that* belongs to that failed test. (In reply to Laszlo Ersek from comment #35) > Either way, I'm not sure that Igor's latest reproducer, from comment#33, > actually reproduces the originally reported issue (comment#0). I > recommend for the original symptom to be reproduced, and then please > upload the firmware log *that* belongs to that failed test. Sorry, I'm in a rush, and misplaced the emphasis in the last phrase. It should go like this: I recommend for the original symptom to be reproduced, and then please upload the firmware log that belongs to *THAT* failed test. Ideally: * Disable IO Port space reservation on all the PCIe root ports (libvirt doesn't directly support this at the moment, so you'll have to hack in the <qemu:arg> elements; see above). -global pcie-root-port.io-reserve=0 * Similarly, on all PCIe root ports, specify a large 64-bit MMIO reservation. Such that any one of these ports can accept the device that (a) you intend to hotplug later and that (b) has the largest *cumulative* 64-bit MMIO BAR demands. For example, reserve 256GB for each port: -global pcie-root-port.pref64-reserve=0x4000000000 - set up the VCPU model (including using KVM) such that it exposes a *large* phys address width, such as 46 bits, or more. I'm unsure about the logic in OVMF that derives the 64-bit aperture size from the phys address with, but from the log in comment 32, phys addr width = 46 bits results in 8192 GB 64-bit MMIO space, so you should be able to fit like 32 ports / devices (256GB reserved) in that. Again, unfortunately, there are many different limits here, and unless you consult the firmware log every time, you won't really know which one you hit. (In reply to Laszlo Ersek from comment #36) > (In reply to Laszlo Ersek from comment #35) > > > Either way, I'm not sure that Igor's latest reproducer, from comment#33, > > actually reproduces the originally reported issue (comment#0). I > > recommend for the original symptom to be reproduced, and then please > > upload the firmware log *that* belongs to that failed test. > > Sorry, I'm in a rush, and misplaced the emphasis in the last phrase. It > should go like this: > > I recommend for the original symptom to be reproduced, and then please > upload the firmware log that belongs to *THAT* failed test. Hotplug fails *because* plug happens into uninitialized port, plugging into properly initialized port worked fine. So I don't think it's PCI pass-through issue/NIC related issue. As long as port is initialized (even with small bridge windows), hotplug works since Windows dynamically re-assigns unused resources. doesn't help: -global pcie-root-port.pref64-reserve=0x4000000000 What helps is "-global pcie-root-port.io-reserve=0", so question is: 1. how many root-ports are too much 2. whether virt-install/virt-manager should create that many when using UEFI 3. should OVMF abandon port if there is not enough IO (PCIE devices behind root-port should work even without IO resources) 4. why IO resources are exhausted on such small trivial configuration. Anyways, I'll re-provision Windows VM and get log in config reported by comment 0 (In reply to Laszlo Ersek from comment #37) [...] > > - set up the VCPU model (including using KVM) such that it exposes a *large* > phys address width, such as 46 bits, or more. I'm unsure about the logic in > OVMF that derives the 64-bit aperture size from the phys address with, but > from the log in comment 32, phys addr width = 46 bits results in 8192 GB > 64-bit MMIO space, so you should be able to fit like 32 ports / devices > (256GB reserved) in that. not likely, this even happens with rhel9.2 ovmf build that should not have recent address space improvements, all allocated bridge mem[64] windows are relatively small and should fit fine PS: edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch Created attachment 1965696 [details]
OVMF log with vfio_hotplug_config
Created attachment 1965697 [details]
domain config vfio_hotplug_config.xml
Used with guest WS2022 Data center ed.
Hotplugged host's XL710 nic: with following snippet which should put NIC on uninitialized root-port (pci.4) <hostdev mode="subsystem" type="pci" managed="yes"> <driver name="vfio"/> <source> <address domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </source> <alias name="hostdev0"/> <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </hostdev> # virsh attach-device win2k22 xl710.xml result from QEMU side: Bus 0, device 2, function 3: PCI bridge: PCI device 1b36:000c IRQ 11, pin A BUS 0. secondary bus 4. subordinate bus 4. IO range [0xf000, 0x0fff] memory range [0xfff00000, 0x000fffff] prefetchable memory range [0xfff00000, 0x000fffff] BAR0: 32 bit memory at 0xc224b000 [0xc224bfff]. id "pci.4" Bus 4, device 0, function 0: Ethernet controller: PCI device 8086:1583 PCI subsystem 8086:0006 BAR0: 64 bit prefetchable memory at 0xffffffffffffffff [0x00fffffe]. BAR3: 64 bit prefetchable memory at 0xffffffffffffffff [0x00007ffe]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0007fffe]. id "hostdev0" from guest side, error: "not enough free resources' As for edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch peaking at hostbits to decide on size of reserved prefetchable window it actually might be happening (see pref address range in comment 28, which is ~32G). Another experiment to confirm that resource reassignment in Windows works: usable port after boot (with forced 2Mb prefetch window): Bus 0, device 2, function 6: PCI bridge: PCI device 1b36:000c IRQ 11, pin A BUS 0. secondary bus 7. subordinate bus 7. IO range [0xd000, 0xdfff] memory range [0xc1e00000, 0xc1ffffff] prefetchable memory range [0x380000000000, 0x3800001fffff] BAR0: 32 bit memory at 0xc2248000 [0xc2248fff]. id "pci.7" after NIC (with larger BAR) hotplug window is resized and hotplug is successful: Bus 0, device 2, function 6: PCI bridge: PCI device 1b36:000c IRQ 11, pin A BUS 0. secondary bus 7. subordinate bus 7. IO range [0xd000, 0xdfff] memory range [0xc1e00000, 0xc1ffffff] prefetchable memory range [0x3807fe000000, 0x3807ff0fffff] BAR0: 32 bit memory at 0xc2248000 [0xc2248fff]. id "pci.7" Bus 7, device 0, function 0: Ethernet controller: PCI device 8086:1583 PCI subsystem 8086:0006 BAR0: 64 bit prefetchable memory at 0x3807fe000000 [0x3807feffffff]. BAR3: 64 bit prefetchable memory at 0x3807ff0f8000 [0x3807ff0fffff]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0007fffe]. id "hostdev0" (In reply to Igor Mammedov from comment #38) > doesn't help: -global pcie-root-port.pref64-reserve=0x4000000000 Meanwhile it has occurred to me that Gerd had made even this logic dynamic in OVMF -- IIRC not only is the aperture sized dyanmically now (based on VCPU phys address width), but the 64-bit MMIO reservation for root ports, too. > What helps is "-global pcie-root-port.io-reserve=0", OK! > so question is: > 1. how many root-ports are too much 10 should work, 11 are too much. > 2. whether virt-install/virt-manager should create that many when using UEFI Not sure. The number 10 is arbitrary, it's the largest contiguous port space (0x6000 and upwards) that we had managed to carve out, given the Q35 board's fixed IO ports. > 3. should OVMF abandon port if there is not enough IO This is already what's happening. It's performed by the generic PCI bus driver in edk2 that OVMF uses. When the driver runs out of resources, it attempts to identify the most resource-hungry device(s), and to abandon those. Thereby allowing multiple, less-hungry, devices to be configured. > (PCIE devices behind root-port should work even without IO resources) Correct, but it's not that simple, of course :) OVMF uses edk2's generic PciBusDxe driver. OVMF has some platform code that can "guide" the generic PCI bus driver. By default, the PCI bus driver allocates 4KB IO for a bridge. Be it a traditional PCI-to-PCI bridge, or a PCI Express root port. (NB, if you try to configure a bunch of (conventional) PCI bridges on i440fx, you'll get the same problem -- in fact you'll hit it much earlier, because on the i440fx board, the largest contiguous IO port space (aperture) is much smaller than on Q35, so you'll see problems after 4-5 bridges or so. Of course, you don't *need* that many bridges on i440fx, so in practice people never see that issue.) OVMF can convince the edk2 PCI bus driver not to reserve IO port space (and this works for individual bridges / root ports), but it doesn't do that *by default*. The defaults were chosen several years ago. See <https://github.com/tianocore/edk2/commit/fe4049471bdf>. The generic PCI bus driver will *not* do the following: attempt to allocate 4KB IO for a root port (by default behavior), fail to allocate the IO, but still configure the port (*without* IO), and the expect devices behind the root port to work without IO. This is not going to happen, the generic PCI bus driver in edk2 just doesn't work this way. If any resource demand for any bridge cannot be satisfied, then the driver will enter this state where it will try to kick out (leave unconfigured) the most resource-hungry devices / bridges. The bridge triggering the "out of resources" condition may not be the bridge that gets left behind. (A bit similarly to how the OOM killer in Linux works -- it's not necessarily the tiny process that runs out of memory that gets reaped.) The following does work: the user sets the IO reservation hint on the root port to zero, OVMF finds that in a PCI capability of the root port, and steers the generic PCI bus driver accordingly. Then the PCI bus driver doesn't even *attempt* to allocate IO for the root port, so it can never run out of IO space -- for that particular bridge anyways. If *none* of the bridges / endpoints run out of resources, then nothing will be abandoned, and then we can expect all devices behind those ports to work, even not having IO. So you can control the IO reservation on a port by port basis, but in general, setting the property to zero, with a -global option, is the most sensible idea. > 4. why IO resources are exhausted on such small trivial configuration. It's not trivial. 11 root ports with each of them requiring IO (and the generic PCI bus driver insisting on a 4KB alignment for each IO block) is not trivial. (In reply to Igor Mammedov from comment #40) > Created attachment 1965696 [details] > OVMF log with vfio_hotplug_config Same issue, just worse. 14 root ports requiring 14 * 4KB IO space. From those, 00:02.1 through 00:02.5 are rejected (5 ports), while 00:02.0, 00:02.6, 00:02.7, plus 00:03.0 through 00:03.5 (9 ports) are allocated fine. The 9 * 4KB space [0x6000 .. 0xEFFF] gets assigned to those root ports, and the 1 * 4KB space at [0xF000 .. 0xFFFF] belongs to the root complex / root bridge (that's where devices integrated in the root complex get their IO BARs assigned from). (In reply to Igor Mammedov from comment #42) > Hotplugged host's XL710 nic: > with following snippet which should put NIC on uninitialized root-port > (pci.4) > > <hostdev mode="subsystem" type="pci" managed="yes"> > <driver name="vfio"/> > <source> > <address domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> (this is the host-side B/D/F, IIUC) > </source> > <alias name="hostdev0"/> > <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> (this is the guest-side B/D/F) So bus="0x04" here corresponds to index='4' from comment#41: <controller type='pci' index='4' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='4' port='0x13'/> <alias name='pci.4'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/> </controller> The PCI B/D/F for that root port is 00:02.3. Indeed, that root port is rejected by the firmware, because there isn't enough IO available. > </hostdev> > > # virsh attach-device win2k22 xl710.xml > > result from QEMU side: > > Bus 0, device 2, function 3: > PCI bridge: PCI device 1b36:000c > IRQ 11, pin A > BUS 0. > secondary bus 4. > subordinate bus 4. > IO range [0xf000, 0x0fff] > memory range [0xfff00000, 0x000fffff] > prefetchable memory range [0xfff00000, 0x000fffff] > BAR0: 32 bit memory at 0xc224b000 [0xc224bfff]. > id "pci.4" > Bus 4, device 0, function 0: > Ethernet controller: PCI device 8086:1583 > PCI subsystem 8086:0006 > BAR0: 64 bit prefetchable memory at 0xffffffffffffffff [0x00fffffe]. > BAR3: 64 bit prefetchable memory at 0xffffffffffffffff [0x00007ffe]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0007fffe]. > id "hostdev0" > > from guest side, error: "not enough free resources' Right, all that looks consistent. > As for edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch peaking at hostbits to > decide on size of reserved prefetchable window > it actually might be happening (see pref address range in comment 28, which > is ~32G). Right: "prefetchable memory range [0x380000000000, 0x3807ffffffff]" -- this is confirmed by the comment#32 log too: > PlatformDynamicMmioWindow: Pci64 Base 0x380000000000 > PlatformDynamicMmioWindow: Pci64 Size 0x80000000000 > ... Anyway I think all of this stuff should just work now if you set the IO reservation for all root ports to zero: <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <qemu:commandline> <qemu:arg value='-global'/> <qemu:arg value='pcie-root-port.io-reserve=0'/> </qemu:commandline> </domain> ... Gerd may also decide, going forward, that the firmware should not reserve IO *by default* for PCIe root ports. I'm not sure, perhaps that's the practical expectation nowadays. Just tested with SeaBIOS, and it does a better job of distributing IO resources. All 30 ports come out properly initialized. Anyways it's not QEMU issue, reassigning it to OVMF for deciding where/whether/how it should be fixed. > ... Gerd may also decide, going forward, that the firmware should not > reserve IO *by default* for PCIe root ports. I'm not sure, perhaps that's > the practical expectation nowadays. Already came to that conclusion too ;) See https://bugzilla.redhat.com/show_bug.cgi?id=2174749#c18 Has links to patch + scratch build. There is a fair chance that the scratch build fixes this bug too. (In reply to Gerd Hoffmann from comment #48) > > ... Gerd may also decide, going forward, that the firmware should not > > reserve IO *by default* for PCIe root ports. I'm not sure, perhaps that's > > the practical expectation nowadays. > > Already came to that conclusion too ;) > > See https://bugzilla.redhat.com/show_bug.cgi?id=2174749#c18 > > Has links to patch + scratch build. > There is a fair chance that the scratch build fixes this bug too. I have tried scratch build, looks like it should help (all empty root-ports have IO disabled and memory ranges initialized) > > See https://bugzilla.redhat.com/show_bug.cgi?id=2174749#c18 > > > > Has links to patch + scratch build. > I have tried scratch build, > looks like it should help > (all empty root-ports have IO disabled and memory ranges initialized) Which is expected behaviour. Good. (In reply to Gerd Hoffmann from comment #50) > > > See https://bugzilla.redhat.com/show_bug.cgi?id=2174749#c18 > > > > > > Has links to patch + scratch build. > > > I have tried scratch build, > > looks like it should help > > (all empty root-ports have IO disabled and memory ranges initialized) > > Which is expected behaviour. Good. So the fix is the same we need for 2203094. Keeping the bugs separate because the test szenarios are quite different. So adding dependency and marking as testonly instead of closing as dup. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |