2050175 – VM with q35, maxcpus=256 and two host devices from the same IOMMU group cannot be started

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2050175 - VM with q35, maxcpus=256 and two host devices from the same IOMMU group cannot be started

Summary: VM with q35, maxcpus=256 and two host devices from the same IOMMU group canno...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Amnon Ilan
QA Contact:	Yanghang Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2048429 2081241
TreeView+	depends on / blocked

Reported:	2022-02-03 12:15 UTC by Milan Zamazal
Modified:	2022-05-13 01:30 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-12 17:24:44 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
QEMU command line (6.94 KB, text/plain) 2022-02-03 12:15 UTC, Milan Zamazal	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-110965	0	None	None	None	2022-02-03 12:18:59 UTC

Description Milan Zamazal 2022-02-03 12:15:31 UTC

Created attachment 1858860 [details]
QEMU command line

Description of problem:

When a VM has the following properties:

- Q35 chipset
- maximum number of CPUs = 256
- two passthrough PCIe devices belonging to the same IOMMU group (for example two network cards from the same IOMMU group or a GPU equipped with an audio device)

then it fails to start, with an error message like:

"vfio 0000:09:00.1: group 16 used in multiple address spaces"

With i440FX chipset or the maximum number of CPUs less than 256, the VM starts.

Version-Release number of selected component (if applicable):

QEMU 6.2.0-5.module+el8.6.0+14015+259232db
libvirt 8.0.0-1.module+el8.6.0+13896+a8fa8f67
kernel 4.18.0-348.12.2.el8_5.x86_64

It happens also with QEMU 6.0 and libvirt 7.6.

How reproducible:

100% in the tested environments

Steps to Reproduce:
1. Create a VM with the properties above (see the attachment for an example QEMU command line).
2. Try to start it.

Actual results:

The VM fails to start.

Expected results:

The VM starts.

Additional info:

It has been observed in RHV environment, see BZ 2048429 for more details.

There was also observed a different error message under similar circumstances, but with a more complex combination of passed host devices (see the bug above for details):

"vfio 0000:af:00.1: failed to setup container for group 150: memory listener initialization failed: Region ram-node0: vfio_dma_map(0x5630dd4b9af0, 0x0, 0x80000000, 0x7f5a73e00000) = -12 (Cannot allocate memory)"

Comment 1 Yanghang Liu 2022-02-07 06:09:12 UTC

Hi Milan，

It seems to me that this bug should be a invalid bug.

The intel iommu device and two PFs (which are in the same iommu group) are conflict with each other for address spaces part, which will block the vm to be started.

> -machine pc-q35-rhel8.4.0,usb=off,dump-guest-core=off,kernel_irqchip=split,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format,graphics=off \
> -device intel-iommu,intremap=on,caching-mode=on,eim=on \
> -device vfio-pci,host=0000:09:00.0,id=ua-21dbb711-7f4f-4958-b6da-f9f052587e6d,bus=pci.4,addr=0x0 \
> -device vfio-pci,host=0000:09:00.1,id=ua-b0231a3f-07ad-442c-9308-62a0fefc9b52,bus=pci.5,addr=0x0 \

The vm is likely to be started successfully after you remove the intel iommu device or the two PFs(in the same iommu group) in the vm configuration.

Can you help retest and confirm it in your test environment ？

Comment 2 Yanghang Liu 2022-02-07 06:24:25 UTC


The folowing bugs are likely to have the same root cause with this bug.

Bug 1619739 - [RFE] vfio non-singleton group + viommu support
Bug 1627499 - [RFE] Account for AddressSpace aliases due to conventional PCI buses
Bug 1715724 - The same iommu_group NICs can not be assigned to a Win2019 guest at the same time

Comment 3 Milan Zamazal 2022-02-14 20:09:07 UTC

Hi Yanghang, thank you for explanation. It indeed looks like the same cause: We add IOMMU when the maximum number of vCPUs >= 256, which is when the problem occurs if there are additionally two devices in the same IOMMU group.

IOMMU is required for max vCPUs >= 256. That means the number of vCPUs is limited when there are multiple devices in the same IOMMU group.

Comment 8 Igor Mammedov 2022-03-21 15:45:04 UTC

I don't see an obvious connection with maxcpus.
Perhaps the best person to look into it is the one who might know more about vfio&co.
CCing Alex.

Comment 9 Alex Williamson 2022-03-21 15:59:49 UTC

As YangHang correctly identifies, the addition of the vIOMMU at >=256 vCPU introduces multiple address spaces for devices, making this configuration invalid.

If we were only to assign one device from the IOMMU group, this would be a valid configuration, for instance the GPU could be installed without the audio function.

Alternatively we'd need to move to a guest PCI topology that doesn't requirement multiple device address spaces.  This can be accomplished with a pcie-to-pci bridge device.  All devices on the conventional PCI side of the bridge share an address space, therefore the restriction of a single address space within an IOMMU group is satisfied.

Comment 11 Alex Williamson 2022-03-30 20:36:19 UTC

(In reply to Amnon Ilan from comment #10)
> Alex, Is it related to bug#1619734?

No, this is an isolation and address space issue, not an accounting issue.  I wish it were an accounting issue, upstream work with iommufd should eventually fix that, though the timeline is not insignificant.

> How do you see the next steps here?

IOMMU groups represent the smallest unit of isolation for device assignment.  In some cases devices are grouped together because the IOMMU cannot distinguish separate devices, in other cases it's because we cannot conclusively determine that untranslated DMA between the devices is prevented.

The former is often a topology issue on the host, for instance host devices on a conventional PCI bus all use the same requester ID. This case cannot be solved, the devices necessarily share an IOV address space.

The latter case is more common for multi-function devices or the result of host interconnect device which might allow redirection, ie. root ports and switches.  For these cases we recommend system, interconnect, and device vendors to support PCIe Access Control Services (ACS) which allows the OS to identify, and in some cases control, the isolation.  For existing hardware, our only option is to consult with the hardware vendor to determine whether equivalent isolation exists in routing or between functions and add software quirks to the kernel to expose the inherent isolation via smaller groups.

Upstream work largely focuses on singleton groups, which is where we expect hardware designed for these sorts of use cases to converge.  So while there might be some opportunity to create separate address spaces with a group using the developments in iommufd, I don't necessarily expect that to be a focus.

The path forward here is to handle multi-devices groups on a case by case basis, identify whether the grouping is the result of the system, the interconnects, or the device itself and work with the appropriate partner to determine if an isolation quirk is appropriate.  Meanwhile, recommend devices and systems that don't have such issues to customers.

Prior to this requirement to support vIOMMU for large numbers of vCPUs, I think the worlds of VMs with both assigned GPUs and vIOMMU didn't often cross paths.  We've already worked actively to provide quirks for many NIC devices.

Comment 12 Yanghang Liu 2022-04-15 02:38:13 UTC

Hi  Milan 

May I ask if your have any other concerns for this bug ?

Is it ok for you that we close this bug ?

Comment 13 Milan Zamazal 2022-04-19 13:59:36 UTC

(In reply to Yanghang Liu from comment #12)

> Is it ok for you that we close this bug ?

Hi Yanghang, as I understand the explanations above, there is no single solution and such problems must be handled case by case. In such a case, let's close the bug. If there is a specific case in future we need to handle then we'll open a separate bug.

Comment 14 Yash Mankad 2022-05-12 17:24:44 UTC

Closing this bug as NOTABUG based on Milan's comment #c13

Note You need to log in before you can comment on or make changes to this bug.