Bug 2179405
| Summary: | [RFE] [qemu] Better PCI hotplug on Q35 machines | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Vivek Goyal <vgoyal> |
| Component: | qemu-kvm | Assignee: | Michael S. Tsirkin <mst> |
| qemu-kvm sub component: | PCI | QA Contact: | Yiqian Wei <yiwei> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | abologna, ailan, berrange, edwardh, imammedo, jinzhao, jsuvorov, juzhang, kwolf, laine, stefanha, virt-maint, zhguo |
| Version: | 9.2 | Keywords: | FutureFeature |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-29 14:01:01 UTC | Type: | Feature Request |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vivek Goyal
2023-03-17 17:23:14 UTC
> My very basic understanding is that I need to keep some root pci ports configured in the VM at the time of VM start and that allows for limited PCI hotplug. If that's the case, that does not scale well and its difficult to plan ahead. Yes, with any PCI-e based machine type (q35 on x86, virt on aarch64) you need to pre-create enough root ports by repeating <controller type='pci' model='pcie-root-port'/>. This is documented here https://libvirt.org/pci-hotplug.html If apps don't want to think about it, and just want hotplug to be on a par with i440fx, then just repeat <controller type='pci' model='pcie-root-port'/> 31 times. They'll be able to add upto 31 devices, whether hot or cold plugged. There are a great many other possible ways to approach the problem. One option is to set the PCI <address> for controllers to make them appear as multi-function devices, which will allow for over 200 devices on the root controller. The other option is to add PCI-e expander buses instead and attach the root ports to those instead. This would be useful if the VM crosses multiple virtual NUMA nodes and you want to express affinity between guest NUMA nodes and devices it is given. In general though I found that PCI did not scale up well for large device counts in QEMU. When I previously tried to boot a VM with 512 PCI devices (virtio-net NICs for the sake of testing) http://file.rdu.redhat.com/~berrange/tiny.xml I gave up waiting for linux to finish booting after 15 minutes. Profiling showed QEU spending all its time doing memory view updates during boot. This was so bad it can't even keep up with injecting timer interrupts. IOW, the linux kernel dmesg time stamps tell you 6 seconds have passed, but in reality it has been 5 minutes of wall clock time. I found that as you add PCI devices the performance drops by a power rule (possible O(n^2)), not linearly (O(n)) as you might naively hope for. For storage, the alternative is to use SCSI which lets you add effectively unlimited devices to a single controller. This is way easier for mgmt apps and the performance scales better in tests I've seen. For example this old blog post: https://rwmj.wordpress.com/2017/04/25/how-many-disks-can-you-add-to-a-virtual-linux-machine/ shows adding 4000 disks to a QEMU guest, and while boot time certainly slows down, it still completes in 10 minutes, and this is on hardware that's significantly older than what I tested with for PCI. I'm not even sure its possible to create a topology servicing that many PCI devices, as I think you run out of other resources first. IOW, even ignoring the complexity of setting up a PCI topology for handling many devices, I'd still recommend using SCSI for the performance benefits in handling large numbers of devices. That frees up PCI slots to be used for other devices like NICs/assigned host devices, where you don't have the luxury of alternatives to PCI. (In reply to Daniel Berrangé from comment #2) > > My very basic understanding is that I need to keep some root pci ports configured in the VM at the time of VM start and that allows for limited PCI hotplug. If that's the case, that does not scale well and its difficult to plan ahead. ...snip... For a comparison based on a somewhat plausible high end VM config of comprising 64 disks (or equivalently 32 disks and 32 NICs, since the type of device doesn't really affect the PCI scalability) * http://file.rdu.redhat.com/~berrange/tiny-q35-64.xml The 64 virtio-blk disks takes 10 seconds to launch QEMU, boot linux until it runs init * http://file.rdu.redhat.com/~berrange/tiny-scsi-64.xml The 64 virtio-scsi disks takes 2 seconds to launch QEMU, boot linux until it runs init So there's a noticeable difference in performance even at 64 disks, but still small enough that users will probably be OK with it. At the high end, I was not able to get past 192 devices with Q35 before Linux reported that it ran out of interrupts. * http://file.rdu.redhat.com/~berrange/tiny-q35-192.xml The 192 virtio-blk disks takes 1 minute 35 seconds to launch QEMU, boot linux until it runs init * http://file.rdu.redhat.com/~berrange/tiny-scsi-512.xml The 512 virtio-scsi disks takes 4 seconds to launch QEMU, boot linux until it runs init Here we see how well SCSI scales, we can't even get to 512 devices with virtio-blk, but given this performance degradation trend we would be waiting an incredibly long time for it to boot linux, but virtio-scsi boot time increase is negligible as disk count increases. > IOW, even ignoring the complexity of setting up a PCI topology for handling many devices, I'd still recommend using SCSI for the performance benefits in handling large numbers of devices. That frees up PCI slots to be used for other devices like NICs/assigned host devices, where you don't have the luxury of alternatives to PCI. The question is what level of scalability do we need to target for # of devices. If 32 devices is sufficient, the scalability of SCSI vs PCI isn't especially noticeable. At 64 devices you notice the win of SCSI, but PCI is likely just about still fast enough to be acceptable. At >100 PCI is going to be a very hard sell. (In reply to Daniel Berrangé from comment #2) > I found that as you add PCI devices the performance drops by a power rule > (possible O(n^2)), not linearly (O(n)) as you might naively hope for. Is there any fundamental reason why it has to be like this, or is this basically a bug? > IOW, even ignoring the complexity of setting up a PCI topology for handling > many devices, I'd still recommend using SCSI for the performance benefits in > handling large numbers of devices. That frees up PCI slots to be used for > other devices like NICs/assigned host devices, where you don't have the > luxury of alternatives to PCI. That's a very one-sided view, though, that prioritises hotplugging lots of disks over everything else. In practice, most VMs probably have only very few disks and only hotplugging isn't the primary use case for disks, but you want to actually use them. Apart from not using up PCI slots, SCSI comes with the advantage of supporting pretty much any feature you could think of, especially if you're passing through SCSI devices from the host. But it comes at the cost of a much higher complexity and lower maintainability. Even when we're only looking at QEMU, performance is reported to be a bit lower with virtio-scsi than with virtio-blk, we're seeing more bugs on it (especially related to iothreads), and while implementing things like multiqueue seem fairly straightforward on the device side for virtio-blk, there are quite some complications with SCSI. So if we want virtio-scsi to not fall even further behind virtio-blk in terms of performance and scalability, this comes with a considerable cost. And that didn't even mention the cases with an external implementation yet where a SCSI implementation doesn't already exist anyway like in QEMU. vhost-user-blk is a relatively simple protocol that can easily implemented. For example, we support it in qemu-storage-daemon as an export option, and in libblkio as a client. These things simply don't exist yet for virtio-scsi, and implementing them would mean creating a full SCSI implementation in each. There are community efforts around vdpa-blk and vhost-blk, but nothing based on virtio-scsi, which I'm sure is at least partially related to the complexity of SCSI, too. So if we're going all in on virtio-scsi, we might improve the hotplug situation for the small percentage of VMs where it is even relevant, but would lose a lot of other things. (In reply to Kevin Wolf from comment #4) > (In reply to Daniel Berrangé from comment #2) > > I found that as you add PCI devices the performance drops by a power rule > > (possible O(n^2)), not linearly (O(n)) as you might naively hope for. > > Is there any fundamental reason why it has to be like this, or is this > basically a bug? I queried this with Paolo a couple of years back and his response was [quote] Yeah, and unfortunately this is known to be O(n^2) (N initializations, each adding 1..N regions to the memory map) and we cannot really do anything about it, but this seems definitely too much. [/quote] I believe there was a bug where we were maintaining duplicate mappings that was since fixed, but these results with git master show the scalability is still degrading by a power. SCSI avoids this problem because adding more LUNS does not imply adding more guest memory mappings. > > IOW, even ignoring the complexity of setting up a PCI topology for handling > > many devices, I'd still recommend using SCSI for the performance benefits in > > handling large numbers of devices. That frees up PCI slots to be used for > > other devices like NICs/assigned host devices, where you don't have the > > luxury of alternatives to PCI. > > That's a very one-sided view, though, that prioritises hotplugging lots of > disks over everything else. In practice, most VMs probably have only very > few disks and only hotplugging isn't the primary use case for disks, but you > want to actually use them. NB, I'm not specifically thinking of hotplug when mentioning the performance issue. With PCIe we would see this problem whether cold plugging or hotplugging lots of disks, because in both cases we need lots of pcie-root-ports present at boot time. With PCI it was less of an issue as there was fewer memory mappings with PCI devices than with PCI-e. Indeed most VMs only have one or two disks so don't experience any problem. If there is, however, a desire to be able to support many disks then we need to be aware of the scalability of virtio-blk vs virtio-scsi. > So if we're going all in on virtio-scsi, we might improve the hotplug > situation for the small percentage of VMs where it is even relevant, but > would lose a lot of other things. NB, not just hotplug - large cold plugged disk counts too. I do agree though that this is a minority of VMs. The question for mgmt apps is what kind of scaling they want to cater for with number of disks, as this impacts the choices to be made, combined with whether they want to support multiple different disks types or standardize on one disk type. > NB, I'm not specifically thinking of hotplug when mentioning the performance
> issue. With PCIe we would see this problem whether cold plugging or hotplugging
> lots of disks, because in both cases we need lots of pcie-root-ports present
> at boot time.
Is the increase dependent on the number of pcie-root-ports, or simply the total number of pcie devices (including controllers and endpoint devices)? In either case, for the case of "cold-plugged" (yech) devices, the overhead could be decreased by using all 8 functions of each pcie-root-port rather than placing each endpoint device on a separate pcie-root-port; this would require more intelligence in our device placement strategy, since we currently have to assume that any device could potentially be hot-unplugged (and so it has to have its own pcie-root-port) (we *do* at least auto-place the pcie-root-ports themselves 8-per-slot on the pcie-root complex (since they can't be hotplugged/unplugged), so we end up running out of other resources before we use up all the slots on pcie-root.
Anyway, on a system with, for example, 64 pcie endpoint devices that weren't being hotplugged, by placing 8 devices on each pcie-root-port, we could end up using 8 pcie-root-ports rather than 64 pcie-root-ports.
(Of course this is kind of getting away from the original topic of the BZ, which was specifically about hotplugging devices.)
Making this type of "tuned" topology was part of the aim of the pci-devaddr project, which we talked about, but it never really took off; the idea would be a pre-processing step, prior to defining a domain in libvirt, where pci-devaddr would be fed a higher-level description of what devices are desired, whether or not they should be hotpluggable, possibly what NUMA node each device should be on, how many free slots for potential hotplug were requested, possibly some other stuff, and pci-devaddr would send back a full list of pci controllers and endpoint devices with pci addresses fully specified.
In the end, even with that you're still going to run up against the hard pcie limits though; you may just be able to delay it somewhat.
(In reply to Daniel Berrangé from comment #5) [..] > > With PCIe we would see this problem whether cold plugging or > hotplugging > lots of disks, because in both cases we need lots of pcie-root-ports present > at boot time. Is it a technology limitation that we have to have additional pcie-root-ports cold plugged to allow future device hotplug. If we can solve this problem, then normal VMs don't have to carry around lots of *potentially unneeded/unused* pcie-root-ports and pay the performance penalty only if devices are hotplugged. IOW, is it possible to hot plug pcie-root-port. (In reply to Vivek Goyal from comment #7) > (In reply to Daniel Berrangé from comment #5) > [..] > > > > With PCIe we would see this problem whether cold plugging or > > hotplugging > > lots of disks, because in both cases we need lots of pcie-root-ports present > > at boot time. > > Is it a technology limitation that we have to have additional pcie-root-ports > cold plugged to allow future device hotplug. If we can solve this problem, > then normal VMs don't have to carry around lots of *potentially > unneeded/unused* > pcie-root-ports and pay the performance penalty only if devices are > hotplugged. > > IOW, is it possible to hot plug pcie-root-port. The PCIE root complex does not permit hotplugging $ qemu-system-x86_64 -M q35 -monitor stdio -display none QEMU 7.0.0 monitor - type 'help' for more information (qemu) device_add pcie-root-port Error: Bus 'pcie.0' does not support hotplugging I don't know the specific reasons why, but it is a well known limitation, and documented by the QEMU PCI subsystem maintainers: https://gitlab.com/qemu-project/qemu/-/blob/master/docs/pcie.txt#L220 and the recommended practice is to pre-create pcie-root-ports at cold boot to allow future hotplug: https://gitlab.com/qemu-project/qemu/-/blob/master/docs/pcie.txt#L254 Additionally, even if the prohibitions on hot-plugging of pcie-root-ports (or pcie-switch-downstream-ports) was overcome, my understanding is that the OSes running in the guest don't rescan their PCIe topology after boot time, so the new controllers/devices wouldn't be seen anyway. This made me think of Bug 1408810. I wonder if allocating IO space is causing at least some of the slowness? And if so, could we modify things across the stack so that it's possible to convince the firmware and kernel to not allocate any, even when a PCIe device is already plugged into the port at boot time? If KubeVirt knows in advance that it's only ever going to use virtio-blk-non-transitional devices, it would be possible for it to take advantage of such an optimization. Anyway, 64 devices is already quite a lot, and even though virtio-blk shows some boot time penalty in this scenario it's really not too bad overall. Can we simply offer virtio-scsi as an alternative, and tell users that when they outgrow the virtio-blk default they can reach for it or, even better, try to rearchitect their deployment so that using that many devices becomes unnecessary? After all there is no free lunch to be had here... Switching from virtio-blk to virtio-scsi makes bigger deployments possible and reduces boot time, but also results in a loss of runtime performance and potentially flexibility due to the factors that Kevin listed. Another CNV PR triggered another round of discussions: https://github.com/kubevirt/community/pull/220 We were discussing this BZ in the last Machine&PCI team meeting and we tend towards closing this BZ: 1. It's too broad, we will need more specific and actionable BZs 2. The "up to 64 devices" approach seems to make sense. In such a configuration, the layered product can populate 64 root ports and the boot time penalty is reasonable 3. In the long run, we may try to improve the boot time (e.g. try to target 240 devices with a reasonable boot time and a reasonable memory overhead) but we can slowly work towards it, and it is not relevant for the short term If no objections, we will close this BZ (but do push back if you see things differently) |