Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2179405

Summary:	[RFE] [qemu] Better PCI hotplug on Q35 machines
Product:	Red Hat Enterprise Linux 9	Reporter:	Vivek Goyal <vgoyal>
Component:	qemu-kvm	Assignee:	Michael S. Tsirkin <mst>
qemu-kvm sub component:	PCI	QA Contact:	Yiqian Wei <yiwei>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	abologna, ailan, berrange, edwardh, imammedo, jinzhao, jsuvorov, juzhang, kwolf, laine, stefanha, virt-maint, zhguo
Version:	9.2	Keywords:	FutureFeature
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-29 14:01:01 UTC	Type:	Feature Request
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vivek Goyal 2023-03-17 17:23:14 UTC

Description of problem:

I am opening this bug in an attempt to figure out what's the state of PCI hotplug on q35 machines. 

This is in the context of a CNV requirement that they want to hotplug lots of
virtio-blk devices.

https://issues.redhat.com/browse/CNV-21636

They ran into issues with virtio-blk so they were considering switching to 
virtio-scsi where it be a scsi device hotplug (and not necessarily consume a PCI slot).

In the long term, our strategy is built around virtio-blk (and not necessarily virtio-scsi). So we want to build a better PCI hotplug story.

My very basic understanding is that I need to keep some root pci ports configured in the VM at the time of VM start and that allows for limited PCI hotplug. If that's the case, that does not scale well and its difficult to plan ahead.

I could be wrong, and will be more than happy to be corrected. 

Hence, opening this bug to understand what's the current state of affairs and what
it will take to improve the story around PCI hotplug.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Daniel Berrangé 2023-03-20 16:43:24 UTC

> My very basic understanding is that I need to keep some root pci ports configured in the VM at the time of VM start and that allows for limited PCI hotplug. If that's the case, that does not scale well and its difficult to plan ahead.

Yes, with any PCI-e based machine type (q35 on x86,  virt on aarch64) you need to pre-create enough root ports by repeating  <controller type='pci' model='pcie-root-port'/>. This is documented here

  https://libvirt.org/pci-hotplug.html

If apps don't want to think about it, and just want hotplug to be on a par with i440fx, then just repeat <controller type='pci' model='pcie-root-port'/>  31 times. They'll be able to add upto 31 devices, whether hot or cold plugged.

There are a great many other possible ways to approach the problem. One option is to set the PCI <address> for controllers to make them appear as multi-function devices, which will allow for over 200 devices on the root controller. The other option is to add PCI-e expander buses instead and attach the root ports to those instead. This would be useful if the VM crosses multiple virtual NUMA nodes and you want to express affinity between guest NUMA nodes and devices it is given.

In general though I found that PCI did not scale up well for large device counts in QEMU.

When I previously tried to boot a VM with 512 PCI devices (virtio-net NICs for the sake of testing)

   http://file.rdu.redhat.com/~berrange/tiny.xml

I gave up waiting for linux to finish booting after 15 minutes. Profiling showed QEU spending all its time doing memory view updates during boot. This was so bad it can't even keep up with injecting timer interrupts. IOW, the linux kernel dmesg time stamps tell you 6 seconds have passed, but in reality it has been 5 minutes of wall clock time.

I found that as you add PCI devices the performance drops by a power rule (possible O(n^2)), not linearly (O(n)) as you might naively hope for. 

For storage, the alternative is to use SCSI which lets you add effectively unlimited devices to a single controller. This is way easier for mgmt apps and the performance scales better in tests I've seen. For example this old blog post:

   https://rwmj.wordpress.com/2017/04/25/how-many-disks-can-you-add-to-a-virtual-linux-machine/

shows adding 4000 disks to a QEMU guest, and while boot time certainly slows down, it still completes in 10 minutes, and this is on hardware that's significantly older than what I tested with for PCI. I'm not even sure its possible to create a topology servicing that many PCI devices, as I think you run out of other resources first.

IOW, even ignoring the complexity of setting up a PCI topology for handling many devices, I'd still recommend using SCSI for the performance benefits in handling large numbers of devices. That frees up PCI slots to be used for other devices like NICs/assigned host devices, where you don't have the luxury of alternatives to PCI.

Comment 3 Daniel Berrangé 2023-03-20 18:05:44 UTC

(In reply to Daniel Berrangé from comment #2)
> > My very basic understanding is that I need to keep some root pci ports configured in the VM at the time of VM start and that allows for limited PCI hotplug. If that's the case, that does not scale well and its difficult to plan ahead.

...snip...

For a comparison based on a somewhat plausible high end VM config of comprising 64 disks  (or equivalently 32 disks and 32 NICs, since the type of device doesn't really affect the PCI scalability)

* http://file.rdu.redhat.com/~berrange/tiny-q35-64.xml

  The 64 virtio-blk disks takes 10 seconds to launch QEMU, boot linux until it runs init

* http://file.rdu.redhat.com/~berrange/tiny-scsi-64.xml

  The 64 virtio-scsi disks takes 2 seconds to launch QEMU, boot linux until it runs init

So there's a noticeable difference in performance even at 64 disks, but still small enough that users will probably be OK with it.


At the high end, I was not able to get past 192 devices with Q35 before Linux reported that it ran out of interrupts.

* http://file.rdu.redhat.com/~berrange/tiny-q35-192.xml

  The 192 virtio-blk disks takes 1 minute 35 seconds to launch QEMU, boot linux until it runs init

* http://file.rdu.redhat.com/~berrange/tiny-scsi-512.xml

  The 512 virtio-scsi disks takes 4 seconds to launch QEMU, boot linux until it runs init

Here we see how well SCSI scales, we can't even get to 512 devices with virtio-blk, but given this performance degradation trend we would be waiting an incredibly long time for it to boot linux, but virtio-scsi boot time increase is negligible as disk count increases.


> IOW, even ignoring the complexity of setting up a PCI topology for handling many devices, I'd still recommend using SCSI for the performance benefits in handling large numbers of devices. That frees up PCI slots to be used for other devices like NICs/assigned host devices, where you don't have the luxury of alternatives to PCI.

The question is what level of scalability do we need to target for # of devices.  If 32 devices is sufficient, the scalability of SCSI vs PCI isn't especially noticeable.  At 64 devices you notice the win of SCSI, but PCI is likely just about still fast enough to be acceptable. At >100 PCI is going to be a very hard sell.

Comment 4 Kevin Wolf 2023-03-21 12:27:00 UTC

(In reply to Daniel Berrangé from comment #2)
> I found that as you add PCI devices the performance drops by a power rule
> (possible O(n^2)), not linearly (O(n)) as you might naively hope for.

Is there any fundamental reason why it has to be like this, or is this basically a bug?

> IOW, even ignoring the complexity of setting up a PCI topology for handling
> many devices, I'd still recommend using SCSI for the performance benefits in
> handling large numbers of devices. That frees up PCI slots to be used for
> other devices like NICs/assigned host devices, where you don't have the
> luxury of alternatives to PCI.

That's a very one-sided view, though, that prioritises hotplugging lots of disks over everything else. In practice, most VMs probably have only very few disks and only hotplugging isn't the primary use case for disks, but you want to actually use them.

Apart from not using up PCI slots, SCSI comes with the advantage of supporting pretty much any feature you could think of, especially if you're passing through SCSI devices from the host. But it comes at the cost of a much higher complexity and lower maintainability.

Even when we're only looking at QEMU, performance is reported to be a bit lower with virtio-scsi than with virtio-blk, we're seeing more bugs on it (especially related to iothreads), and while implementing things like multiqueue seem fairly straightforward on the device side for virtio-blk, there are quite some complications with SCSI. So if we want virtio-scsi to not fall even further behind virtio-blk in terms of performance and scalability, this comes with a considerable cost.

And that didn't even mention the cases with an external implementation yet where a SCSI implementation doesn't already exist anyway like in QEMU. vhost-user-blk is a relatively simple protocol that can easily implemented. For example, we support it in qemu-storage-daemon as an export option, and in libblkio as a client. These things simply don't exist yet for virtio-scsi, and implementing them would mean creating a full SCSI implementation in each. There are community efforts around vdpa-blk and vhost-blk, but nothing based on virtio-scsi, which I'm sure is at least partially related to the complexity of SCSI, too.

So if we're going all in on virtio-scsi, we might improve the hotplug situation for the small percentage of VMs where it is even relevant, but would lose a lot of other things.

Comment 5 Daniel Berrangé 2023-03-21 12:42:27 UTC

(In reply to Kevin Wolf from comment #4)
> (In reply to Daniel Berrangé from comment #2)
> > I found that as you add PCI devices the performance drops by a power rule
> > (possible O(n^2)), not linearly (O(n)) as you might naively hope for.
> 
> Is there any fundamental reason why it has to be like this, or is this
> basically a bug?

I queried this with Paolo a couple of years back and his response was

[quote]
Yeah, and unfortunately this is known to be O(n^2) (N initializations,
each adding 1..N regions to the memory map) and we cannot really do
anything about it, but this seems definitely too much.
[/quote]

I believe there was a bug where we were maintaining duplicate mappings
that was since fixed, but these results with git master show the
scalability is still degrading by a power.

SCSI avoids this problem because adding more LUNS does not imply adding
more guest memory mappings.

> > IOW, even ignoring the complexity of setting up a PCI topology for handling
> > many devices, I'd still recommend using SCSI for the performance benefits in
> > handling large numbers of devices. That frees up PCI slots to be used for
> > other devices like NICs/assigned host devices, where you don't have the
> > luxury of alternatives to PCI.
> 
> That's a very one-sided view, though, that prioritises hotplugging lots of
> disks over everything else. In practice, most VMs probably have only very
> few disks and only hotplugging isn't the primary use case for disks, but you
> want to actually use them.

NB, I'm not specifically thinking of hotplug when mentioning the performance
issue. With PCIe we would see this problem whether cold plugging or hotplugging
lots of disks, because in both cases we need lots of pcie-root-ports present
at boot time. With PCI it was less of an issue as there was fewer memory
mappings with PCI devices than with PCI-e.

Indeed most VMs only have one or two disks so don't experience any problem.

If there is, however, a desire to be able to support many disks then we
need to be aware of the scalability of virtio-blk vs virtio-scsi.

> So if we're going all in on virtio-scsi, we might improve the hotplug
> situation for the small percentage of VMs where it is even relevant, but
> would lose a lot of other things.

NB, not just hotplug - large cold plugged disk counts too.  I do agree
though that this is a minority of VMs. The question for mgmt apps is
what kind of scaling they want to cater for with number of disks, as
this impacts the choices to be made, combined with whether they want
to support multiple different disks types or standardize on one disk
type.

Comment 6 Laine Stump 2023-03-22 03:35:17 UTC

> NB, I'm not specifically thinking of hotplug when mentioning the performance
> issue. With PCIe we would see this problem whether cold plugging or hotplugging
> lots of disks, because in both cases we need lots of pcie-root-ports present
> at boot time.

Is the increase dependent on the number of pcie-root-ports, or simply the total number of pcie devices (including controllers and endpoint devices)? In either case, for the case of "cold-plugged" (yech) devices, the overhead could be decreased by using all 8 functions of each pcie-root-port rather than placing each endpoint device on a separate pcie-root-port; this would require more intelligence in our device placement strategy, since we currently have to assume that any device could potentially be hot-unplugged (and so it has to have its own pcie-root-port) (we *do* at least auto-place the pcie-root-ports themselves 8-per-slot on the pcie-root complex (since they can't be hotplugged/unplugged), so we end up running out of other resources before we use up all the slots on pcie-root.

Anyway, on a system with, for example, 64 pcie endpoint devices that weren't being hotplugged, by placing 8 devices on each pcie-root-port, we could end up using 8 pcie-root-ports rather than 64 pcie-root-ports.

(Of course this is kind of getting away from the original topic of the BZ, which was specifically about hotplugging devices.)

Making this type of "tuned" topology was part of the aim of the pci-devaddr project, which we talked about, but it never really took off; the idea would be a pre-processing step, prior to defining a domain in libvirt, where pci-devaddr would be fed a higher-level description of what devices are desired, whether or not they should be hotpluggable, possibly what NUMA node each device should be on, how many free slots for potential hotplug were requested, possibly some other stuff, and pci-devaddr would send back a full list of pci controllers and endpoint devices with pci addresses fully specified.

In the end, even with that you're still going to run up against the hard pcie limits though; you may just be able to delay it somewhat.

Comment 7 Vivek Goyal 2023-03-22 12:33:02 UTC

(In reply to Daniel Berrangé from comment #5)
[..]
> 
> With PCIe we would see this problem whether cold plugging or
> hotplugging
> lots of disks, because in both cases we need lots of pcie-root-ports present
> at boot time. 

Is it a technology limitation that we have to have additional pcie-root-ports
cold plugged to allow future device hotplug. If we can solve this problem,
then normal VMs don't have to carry around lots of *potentially unneeded/unused*
pcie-root-ports and pay the performance penalty only if devices are hotplugged.

IOW, is it possible to hot plug pcie-root-port.

Comment 8 Daniel Berrangé 2023-03-22 12:54:03 UTC

(In reply to Vivek Goyal from comment #7)
> (In reply to Daniel Berrangé from comment #5)
> [..]
> > 
> > With PCIe we would see this problem whether cold plugging or
> > hotplugging
> > lots of disks, because in both cases we need lots of pcie-root-ports present
> > at boot time. 
> 
> Is it a technology limitation that we have to have additional pcie-root-ports
> cold plugged to allow future device hotplug. If we can solve this problem,
> then normal VMs don't have to carry around lots of *potentially
> unneeded/unused*
> pcie-root-ports and pay the performance penalty only if devices are
> hotplugged.
> 
> IOW, is it possible to hot plug pcie-root-port.

The PCIE root complex does not permit hotplugging

$ qemu-system-x86_64 -M q35 -monitor stdio -display none
QEMU 7.0.0 monitor - type 'help' for more information
(qemu) device_add pcie-root-port
Error: Bus 'pcie.0' does not support hotplugging

I don't know the specific reasons why, but it is a well known limitation, and documented by the QEMU PCI subsystem maintainers:

  https://gitlab.com/qemu-project/qemu/-/blob/master/docs/pcie.txt#L220

and the recommended practice is to pre-create pcie-root-ports at cold boot to allow future hotplug:

  https://gitlab.com/qemu-project/qemu/-/blob/master/docs/pcie.txt#L254

Comment 9 Laine Stump 2023-03-24 14:04:32 UTC

Additionally, even if the prohibitions on hot-plugging of pcie-root-ports (or pcie-switch-downstream-ports) was overcome, my understanding is that the OSes running in the guest don't rescan their PCIe topology after boot time, so the new controllers/devices wouldn't be seen anyway.

Comment 10 Andrea Bolognani 2023-03-28 17:00:37 UTC

This made me think of Bug 1408810. I wonder if allocating IO space is
causing at least some of the slowness? And if so, could we modify
things across the stack so that it's possible to convince the
firmware and kernel to not allocate any, even when a PCIe device is
already plugged into the port at boot time? If KubeVirt knows in
advance that it's only ever going to use virtio-blk-non-transitional
devices, it would be possible for it to take advantage of such an
optimization.

Anyway, 64 devices is already quite a lot, and even though virtio-blk
shows some boot time penalty in this scenario it's really not too
bad overall. Can we simply offer virtio-scsi as an alternative, and
tell users that when they outgrow the virtio-blk default they can
reach for it or, even better, try to rearchitect their deployment so
that using that many devices becomes unnecessary?

After all there is no free lunch to be had here... Switching from
virtio-blk to virtio-scsi makes bigger deployments possible and
reduces boot time, but also results in a loss of runtime performance
and potentially flexibility due to the factors that Kevin listed.

Comment 11 Amnon Ilan 2023-05-16 14:11:24 UTC

Another CNV PR triggered another round of discussions: 
https://github.com/kubevirt/community/pull/220

We were discussing this BZ in the last Machine&PCI team meeting and we tend towards closing this BZ:
1. It's too broad, we will need more specific and actionable BZs
2. The "up to 64 devices" approach seems to make sense. In such a configuration, the layered product can populate 64 root ports and the boot time penalty is reasonable  
3. In the long run, we may try to improve the boot time (e.g. try to target 240 devices with a reasonable boot time and a reasonable memory overhead) but we can slowly work towards it, and it is not relevant for the short term

If no objections, we will close this BZ (but do push back if you see things differently)