Bug 1967494
Summary: | kernel BUG at mm/ioremap.c:76 for a guest exposed with pcie expander bridge/root port | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Eric Auger <eric.auger> |
Component: | kernel | Assignee: | Virtualization Maintenance <virt-maint> |
kernel sub component: | KVM | QA Contact: | Virtualization Bugs <virt-bugs> |
Status: | CLOSED NOTABUG | Docs Contact: | |
Severity: | low | ||
Priority: | low | CC: | abologna, drjones, gshan, lersek, mstowe, rvr, virt-maint |
Version: | 9.0 | ||
Target Milestone: | beta | ||
Target Release: | --- | ||
Hardware: | aarch64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-06-24 08:06:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eric Auger
2021-06-03 08:49:18 UTC
Created attachment 1793082 [details]
Traces featuring upstream EDK2 in DEBUG mode and linux BUG_ON()
To me it looks like an EDK2 issue: PciBus: Resource Map for Root Bridge PciRoot(0x0) -> GPEX Type = Io16; Base = 0x0; Length = 0x3000; Alignment = 0xFFF PciBus: Resource Map for Root Bridge PciRoot(0x4) -> PCIe expander bridge Type = Io16; Base = 0x3000; Length = 0x1000; Alignment = 0xFFF ^ The base is not aligned with the guest 64kB page Then on PCIe expander bridge Io16 remap, we have pci_remap_iospace vaddr=0xfffffffefe800000, size=0x1000, phys_addr=0x3eff3000 ^ which hits a BUG_ON on vmap_range because I think the page already has an entry that was created when Ioremapping the GPX Io16 Do you share my understanding? Should be the alignement 0xFFFF instead? UEFI (per spec) only deals with a single (last level) page size, and that's 4KB. Short version: please set the following option on the QEMU command line, and retry: -global pcie-root-port.io-reserve=0 Long version: here's an excerpt from my email that I sent to Eric basically in parallel to the above needinfo being set on me. I agree with your analysis that it's an alignment issue. The PCI bus driver in the firmware, "MdeModulePkg/Bus/Pci/PciBusDxe", assigns bridge IO port resources with a 4KB alignment (see "BridgeIoAlignment"). Furthermore, on aarch64/virt, the IO port aperture is simulated through a special MMIO aperture. The guest kernel is however unable to map the 4KB MMIO areas in question separately, in units of 64KiB. Now, the guest firmware actually lets you dictate the "padding size" for resource reservation. This was enabled for the above BZ in commit e843a21e23ea ("ArmVirtPkg/ArmVirtQemu: Add support for HotPlug", 2021-01-20). The syntax is -device pcie-root-port,[properties],io-reserve=... While this is primarily for hotplug purposes, I think you could theoretically use it for enforcing alignment as well... Unfortunately however, the entire IO port aperture (which is simulated through MMIO on aarch64/virt), is only 64KiB! All root bridges on the sole host bridge share that aperture, and it is only 64KiB. This is parsed by the firmware from QEMU's DTB, and it is logged as: ProcessPciHost: Config[0x4010000000+0x10000000) Bus[0x0..0xFF] Io[0x0+0x10000)@0x3EFF0000 Mem32[0x10000000+0x2EFF0000)@0x0 Mem64[0x8000000000+0x8000000000)@0x0 The relevant part is "Io[0x0+0x10000)@0x3EFF0000". It says that the full size is 64KiB (0x10000), and it is based at MMIO 0x3EFF0000 (you can see that constant in the kernel messages too). So... even if you could force the IO port reservation *per root port* to be 64KiB, that wouldn't work, because you have 64KiB IO port space, for all root ports together How about this instead: PCIe devices are required to function properly without IO resources. So what if you explicitly state that *zero* IO port space should be reserved for the root port in question? -device pcie-root-port,[properties],io-reserve=0 ^^^^^^^^^^^^ This will definitely cause the guest firmware to reserve no IO port space for the root port; therefore the guest kernel should not attempt to io-map any such MMIO range (regardless of page size). In fact, you can do this for *all* PCIe root ports at once: -global pcie-root-port.io-reserve=0 (This is what I have in one of my libvirt domain XMLs: <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <qemu:commandline> <qemu:arg value='-global'/> <qemu:arg value='pcie-root-port.io-reserve=0'/> </qemu:commandline> </domain> ... I have now tried this with one of my long-term aarch64 libvirt domains (no pxb-pcie usage, just one root bridge with several root ports on it), and *all* Io16 resources have disappeared from the firmware log. Hi Laszlo, many thanks for your detailed answer here and in the separate email. Effectively this works fine in my case at qemu level and I don't see any Io16 allocation anymore in the EDK2 log. Before submitting a new BZ at libvirt level I would like to double check with all the stake holders what could be the overall consequences of globally setting io-reserve=0 for the supported guest PCIe topology. Does anyone foresee any possible regression for supported PCIe/PCI devices? Thanks Eric Actually we can reduce the scope of io-reserve=0 to the root port plugged onto the pcie expander bridge. Then an Io16 region is allocated once for the GPEX and none is attempted for the PXB. I checked this works too and this does not bring any regression on existing supported use cases. So I am going to close that bug as NO_BUG and add those info in the associated libvirt BZ. |