Bug 2103533
| Summary: | Booting up guest with virtio transitional devices failed on RHEL9 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Min Deng <mdeng> |
| Component: | Documentation | Assignee: | Jiri Herrmann <jherrman> |
| Documentation sub component: | default | QA Contact: | |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | berrange, coli, jinzhao, juzhang, kraxel, lijin, mdeng, meili, osteffen, pbonzini, rhel-docs, virt-maint |
| Version: | 9.1 | Keywords: | Documentation, Reopened, Triaged |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-30 09:54:51 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Min Deng
2022-07-04 02:58:41 UTC
Klaus - probably not assigned to the right component, but I believe this is either for edk/uefi or network (Ariel's team) as the failing boot option notes network remnants. I'll start with you for uefi/edi, but I also wouldn't be surprised if there's some sort of configuration problem too. The provided guest startup command line has the following for my somewhat untrained eyes looking for transitional devices and their complementary device: ... -device ahci,id=ahci0,bus=pcie.0,addr=0x3 ... -device pcie-root-port,port=0x10,chassis=1,id=pcie-root-port0,bus=pcie.0,multifunction=on,addr=0x4 -device virtio-scsi-pci-transitional,id=scsi0,bus=pcie-root-port0 ... -device pcie-root-port,port=0x12,chassis=3,id=pcie-root-port2,bus=pcie.0,addr=0x4.0x2 -device virtio-serial-pci-transitional,id=virtio-serial0,bus=pcie-root-port2 ... -device pcie-root-port,port=0x14,chassis=5,id=pcie-root-port4,bus=pcie.0,addr=0x4.0x4 -device virtio-net-pci-transitional,netdev=hostnet2,id=virtio-net-pci2,mac=00:52:68:26:31:04,bus=pcie-root-port4 -netdev tap,id=hostnet2,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown ... -device pcie-root-port,port=0x18,chassis=9,id=pcie-root-port8,bus=pcie.0,multifunction=on,addr=0x10 ... -device pcie-root-port,port=0x23,chassis=14,id=pcie-root-port13,bus=pcie.0,addr=0x10.0x5 -device virtio-rng-pci-transitional,id=rng0,bus=pcie-root-port13 ... -device pcie-root-port,port=0x24,chassis=15,id=pcie-root-port14,bus=pcie.0,addr=0x10.0x6 -device virtio-balloon-pci-transitional,id=balloon0,bus=pcie-root-port14 If the guest was attached with the following devices, the guest could boot up well. devices list: virtio-scsi-pci virtio-serial-pci virtio-net-pci virtio-balloon-pci virtio-rng-pci (In reply to John Ferlan from comment #3) > Klaus - probably not assigned to the right component, but I believe this is > either for edk/uefi or network (Ariel's team) as the failing boot option > notes network remnants. I'll start with you for uefi/edi, but I also > wouldn't be surprised if there's some sort of configuration problem too. > Let's start with Gerd who is around this Friday for anything obvious, but next week he's on PTO so assigning to Oliver. It really feels like the 'transactional' variants of the devices are not wired up correctly in the EDK2 pre-boot environment.. Min, could you please post the firmware log?
To generate this you could, for example, change this option
-chardev socket,id=seabioslog_id,path=/home/seabios,server=on,wait=off
to
-chardev file,id=seabioslog_id,path="/home/firmware.log"
and rerun the test.
The problem seems to be that the quest runs out of I/O address space
and can't set up all devices.
(Thanks Gerd for pointing in that direction).
The firmware log shows messages like these:
PciHostBridge: SubmitResources for PciRoot(0x0)
I/O: Granularity/SpecificFlag = 0 / 01
Length/Alignment = 0x10000 / 0xFFF
Mem: Granularity/SpecificFlag = 32 / 00
Length/Alignment = 0x3100000 / 0xFFFFFF
Mem: Granularity/SpecificFlag = 64 / 00
Length/Alignment = 0x600000 / 0xFFFFF
PciBus: HostBridge->SubmitResources() - Success
PciHostBridge: NotifyPhase (AllocateResources)
RootBridge: PciRoot(0x0)
Mem: Base/Length/Alignment = C0000000/3100000/FFFFFF - Success
Mem64: Base/Length/Alignment = 5000000000/600000/FFFFF - Success
I/O: Base/Length/Alignment = FFFFFFFFFFFFFFFF/10000/FFF - Out Of Resource!
Call PciHostBridgeResourceConflict().
PciHostBridge: Resource conflict happens!
RootBridge[0]:
I/O: Length/Alignment = 0x10000 / 0xFFF
Mem: Length/Alignment = 0x3100000 / 0xFFFFFF
Granularity/SpecificFlag = 32 / 00
Mem: Length/Alignment = 0x0 / 0x0
Granularity/SpecificFlag = 32 / 06 (Prefetchable)
Mem: Length/Alignment = 0x600000 / 0xFFFFF
Granularity/SpecificFlag = 64 / 00
Mem: Length/Alignment = 0x0 / 0x0
Granularity/SpecificFlag = 64 / 06 (Prefetchable)
Bus: Length/Alignment = 0x12 / 0x0
PciBus: HostBridge->NotifyPhase(AllocateResources) - Out of Resources
PciBus: [00|04|07] was rejected due to resource confliction.
This repeats for multiple devices.
The qemu command posted above uses a lot of pci devices.
The "transitional" variants support the new and the old virtio standard and require additional I/O ranges.
Because of this the problem occurs when using "transitional" and not with the default variants.
Additionally the setup puts each device into a separate pcie root port,
which even increases the resource requirements.
Workarounds:
1) reduce number of devices used
2) group several devices together in a shared pcie root port
Has this exact setup worked with previous software versions ?
If not, I would say this is not a bug.
If yes, maybe something causes the guest to run out of resources quicker.
Thanks for the comments. According to the past test results, the issue didn't happen on rhel8 product line. Interesting. Can you also post a firmware log of that? I would like to compare. It works fine on rhel8.4 guest with seabios configuration according to an piece of test results for stable guest abi. kernel-4.18.0-305.58.1.el8_4.x86_64 qemu-kvm-docs-5.2.0-16.module+el8.4.0+15200+eac37ce4.17.x86_64] For ovmf, It's ongoing, will let you know once I have the log QE cover transitional devices test on rhel8 and rhel9 for feature stable guest abi, and all the devices in the command is necessary in the test plan :) On RHEL8 1.seabios is 1st priority The guest with seabios works well. 2.ovmf is 2nd priority The guest with ovmf has issues, and please help a look on the log, should QE file a new bug for rhel 8 ? On RHEL 9 1.ovmf is 1nd priority The guest with ovmf hit this bug, and already uploaded the firemware.log 2.seabios is 2st priority don't hit the issue. Looking at both firmware log files: $ grep "was rejected due to resource confliction." firmware_log_rhel8 130 PciBus: [00|04|00] was rejected due to resource confliction. PciBus: [00|04|01] was rejected due to resource confliction. PciBus: [00|04|02] was rejected due to resource confliction. PciBus: [00|04|03] was rejected due to resource confliction. PciBus: [00|04|04] was rejected due to resource confliction. PciBus: [00|04|05] was rejected due to resource confliction. PciBus: [00|04|06] was rejected due to resource confliction. PciBus: [00|04|07] was rejected due to resource confliction. PciBus: [00|10|00] was rejected due to resource confliction. PciBus: [00|10|01] was rejected due to resource confliction. PciBus: [00|10|02] was rejected due to resource confliction. PciBus: [00|10|03] was rejected due to resource confliction. PciBus: [00|10|04] was rejected due to resource confliction. PciBus: [00|10|05] was rejected due to resource confliction. PciBus: [00|10|06] was rejected due to resource confliction. PciBus: [00|10|07] was rejected due to resource confliction. PciBus: [10|00|00] was rejected due to resource confliction. PciBus: [01|00|00] was rejected due to resource confliction. $ grep "was rejected due to resource confliction." firmware_log_rhel9 PciBus: [00|04|00] was rejected due to resource confliction. PciBus: [00|04|01] was rejected due to resource confliction. PciBus: [00|04|02] was rejected due to resource confliction. PciBus: [00|04|03] was rejected due to resource confliction. PciBus: [00|04|04] was rejected due to resource confliction. PciBus: [00|04|05] was rejected due to resource confliction. PciBus: [00|04|06] was rejected due to resource confliction. PciBus: [00|04|07] was rejected due to resource confliction. PciBus: [00|10|00] was rejected due to resource confliction. PciBus: [00|10|01] was rejected due to resource confliction. PciBus: [00|10|02] was rejected due to resource confliction. PciBus: [00|10|03] was rejected due to resource confliction. PciBus: [00|10|04] was rejected due to resource confliction. PciBus: [00|10|05] was rejected due to resource confliction. PciBus: [00|10|06] was rejected due to resource confliction. PciBus: [00|10|07] was rejected due to resource confliction. PciBus: [10|00|00] was rejected due to resource confliction. PciBus: [01|00|00] was rejected due to resource confliction. In both cases (RHEL 8, RHEL 9) some root ports are rejected. I wanted to figure out how many root ports can fit in while still being able to boot. I created a simple Qemu setup and added more and more root ports until the VM would not boot anymore. Turns out: 9 root ports is the maximum for OVMF, for both RHEL8 and RHEL9 hosts. Seabios: at least 14 (stopped testing). Using lspci to look at the I/O address ranges used, I can see that in the OVMF case port ranges start at a higher number than in the Seabios case (0x6000 vs 0x1000). That's why we run out of addresses earlier on OVMF. OVMF (RHEL 8 and RHEL 9 identical): $ lspci -v | grep I/O I/O ports at f060 [size=32] I/O behind bridge: 0000e000-0000efff [size=4K] I/O behind bridge: 0000d000-0000dfff [size=4K] I/O behind bridge: 0000c000-0000cfff [size=4K] I/O behind bridge: 0000b000-0000bfff [size=4K] I/O behind bridge: 0000a000-0000afff [size=4K] I/O behind bridge: 00009000-00009fff [size=4K] I/O behind bridge: 00008000-00008fff [size=4K] I/O behind bridge: 00007000-00007fff [size=4K] I/O behind bridge: 00006000-00006fff [size=4K] I/O ports at f040 [size=32] I/O ports at f000 [size=64] I/O ports at e000 [size=64] I/O ports at d000 [size=64] I/O ports at c000 [size=64] I/O ports at b000 [size=64] I/O ports at a000 [size=64] I/O ports at 9000 [size=64] I/O ports at 8000 [size=64] I/O ports at 7000 [size=64] I/O ports at 6000 [size=64] Seabios (RHEL 8 and RHEL 9 identical): $ lspci -v | grep I/O I/O ports at a040 [size=32] I/O behind bridge: 00009000-00009fff [size=4K] I/O behind bridge: 00008000-00008fff [size=4K] I/O behind bridge: 00007000-00007fff [size=4K] I/O behind bridge: 00006000-00006fff [size=4K] I/O behind bridge: 00005000-00005fff [size=4K] I/O behind bridge: 00004000-00004fff [size=4K] I/O behind bridge: 00003000-00003fff [size=4K] I/O behind bridge: 00002000-00002fff [size=4K] I/O behind bridge: 00001000-00001fff [size=4K] I/O ports at a060 [size=32] I/O ports at 0700 [size=64] I/O ports at 9000 [size=64] I/O ports at 8000 [size=64] I/O ports at 7000 [size=64] I/O ports at 6000 [size=64] I/O ports at 5000 [size=64] I/O ports at 4000 [size=64] I/O ports at 3000 [size=64] I/O ports at 2000 [size=64] I/O ports at 1000 [size=64] When using the "non-transitional" flavor the I/O port ranges are not present in the lspci output (both Seabios and OVMF) (as expected). The question is: Why do the I/O address ranges start at higher values in the OVMF case? > In both cases (RHEL 8, RHEL 9) some root ports are rejected. > > I wanted to figure out how many root ports can fit in while still being > able to boot. I created a simple Qemu setup and added more and more > root ports until the VM would not boot anymore. > > Turns out: 9 root ports is the maximum for OVMF, for both RHEL8 and > RHEL9 hosts. Seabios: at least 14 (stopped testing). Probably 15, certainly not more than 16. Might be that seabios also has a slightly different allocation strategy and doesn't consider lack of io address space for pcie devices a fatal error, so some root ports get io ports and some don't if you have >16. > Using lspci to look at the I/O address ranges used, I can see that > in the OVMF case port ranges start at a higher number than in the Seabios > case (0x6000 vs 0x1000). That's why we run out of addresses earlier on OVMF. There is vmport somewhere in 0x5xxx. OVMF is more conservative and avoids placing anything there by starting at 0x6000. Laszlo Ersek commented: (1) The fact that OVMF has less IO space than SeaBIOS has been known for a long time. Refer to <https://bugzilla.redhat.com/show_bug.cgi?id=1333238> (comment 27 identifies the upstream edk2 commit range, where we increased the IO space *to* 40KB). See also section "3. IO space issues" in QEMU's "docs/pcie.txt" file. (2) The solution to the IO space shortage is to use non-transitional (= virtio-1.0-only) devices. Trying to bring a virtio-transitional device set to OVMF on Q35 (to the PCIe hierarchy) is effectively sitting face backwards on the horse. You *can* use virtio-transitional if you really insist, but then you should create a PCI sub-hierarchy (see again "docs/pcie.txt"), and move the virtio-transitional devices there. The difference this makes is that, with the PCIe root ports eliminated from the picture, you won't lose 4KB IO space per virtio device (because a single device will no longer imply a separate PCIe root port, hence a separate bridge, hence a separate 4KB IO window). Instead, all virtio-transitional devices (with their piece-wise small IO window requirements) will now be collected into a larger, common bridge (same as the root bridge on i440fx), so the "large" IO window allocation only needs to happen once. ----------------------------------- With this, I am closing this BZ as not-a-bug. The associated test case should either be deleted or redesigned. I am not sure what it exactly tries to assert, so I can't comment on that now. See also this email from Daniel Berrangé with additional information: On Tue, Jul 26, 2022 at 06:50:25PM +0200, Oliver Steffen wrote: > Hi everyone, > > I am working on https://bugzilla.redhat.com/show_bug.cgi?id=2103533 . The > problem is, that a special and somewhat particular configuration of > virtio-transitional devices works when using Seabios, but fails due to > insufficient I/O address space when using OVMF. This does not happen > when using the "normal" virtio flavor. > > We end up with the following combinations (Yes=works, No=does not): > > | virtio | virtio-transitional | > --------|--------|---------------------| > Seabios | Yes | Yes | > OVMF | Yes | No | > > The transitional flavor of the virtio devices is backwards compatible > to an earlier standard which makes use of I/O ports. The current version > of virtio does not, and relies on memory mapping only. > > The reason for running out of I/O addresses earlier on OVMF is that > OVMF starts allocating I/O regions at 0x6000 (reserving some areas), > while Seabios starts at 0x1000. There is only 65k of I/O space. > > My question is: > - Is this a problem that we need to fix? Or do we need to change the > test case instead. I have to point out that the pci configuration is > probably not very realistic and seems crafted. It uses a large number of > root ports (see BZ). Yes, that configuration doesn't look like something we'd see in the real world. Historically with i440fx, the devices would all be transitional as they get plugged into PCI buses. When we switched to q35, because libvirt puts all the devices on a PCI-e bus (the pcie-root-port), the devices get legacy mode auto-disabled by QEMU, so become modern-only devices. This caused a problem with RHEL-6 and earlier guest compat, so we introduced the -transitional device variants. This basically causes libvirt to put the devices behind a PCIe-to-PCI bridge, such that legacy mode remains enabled in QEMU. The test case in the BZ is putting -transitional devices on a pcie-root-port, and as such looks almost entirely pointless. It doesn't address any use case I know of. We don't need to forbid this, but IMHO it isn't worth worrying about running out of I/O space with such a config. > - Is switching between OVMF<->Seabios, and independently > virtio<->virtio-transitional expected to be possible at all > conditions? In general I'd expect a VM to be able to be switched between OVMF/SeaBIOS, provided its partitioned ina way that allows it to boot in both scenarios, and isn't relying on UEFI only features like secureboot. I wouldn't really expect virtio vs virtio-transitional choice to be related to the firmware though. THis is a decision tied to the machine type (i440fx vs q35), and guest OS combination. > - If we need to fix OVMF, then how? Unless I've missed something, this issue doesn't look too important to me. Thanks for developer's comments, should we document this issue somewhere ? As OVMF is P1 on rhel9, we'd better have one in my opinions.
...
> We end up with the following combinations (Yes=works, No=does not):
>
> | virtio | virtio-transitional |
> --------|--------|---------------------|
> Seabios | Yes | Yes |
> OVMF | Yes | No |
>
I will re-open it and request a document bug, feel free to assign it to other components, thanks a lot. Hi Jiri, Thanks for your comments ! I suggested adding this issue to the document because OVMF is P2 on rhel8 but now it's P1 in RHEL9, so our users will have different scenarios in rhel9. Why do we have this virtio-transitional test due to bug 2025468. If we don’t support this(bug), we should let users know that migration of virtio-transitional devices is not supported from rhel8tothel9/rhel9torhel9 with ovmf configuration since virtio-transitional devices are not supported with ovmf on rhel9. Or you can refer to the screenshot of the this bug for your reference. Feel free to let me know if you have any concerns. Thank you Min |