Bug 2050175

Summary: VM with q35, maxcpus=256 and two host devices from the same IOMMU group cannot be started
Product: Red Hat Enterprise Linux 8 Reporter: Milan Zamazal <mzamazal>
Component: qemu-kvmAssignee: Amnon Ilan <ailan>
qemu-kvm sub component: Devices QA Contact: Yanghang Liu <yanghliu>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: ahadas, alex.williamson, chayang, coli, gilboad, imammedo, jinzhao, juzhang, mst, virt-maint, yanghliu, ymankad
Version: 8.6   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-12 17:24:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2048429, 2081241    
Attachments:
Description Flags
QEMU command line none

Description Milan Zamazal 2022-02-03 12:15:31 UTC
Created attachment 1858860 [details]
QEMU command line

Description of problem:

When a VM has the following properties:

- Q35 chipset
- maximum number of CPUs = 256
- two passthrough PCIe devices belonging to the same IOMMU group (for example two network cards from the same IOMMU group or a GPU equipped with an audio device)

then it fails to start, with an error message like:

"vfio 0000:09:00.1: group 16 used in multiple address spaces"

With i440FX chipset or the maximum number of CPUs less than 256, the VM starts.

Version-Release number of selected component (if applicable):

QEMU 6.2.0-5.module+el8.6.0+14015+259232db
libvirt 8.0.0-1.module+el8.6.0+13896+a8fa8f67
kernel 4.18.0-348.12.2.el8_5.x86_64

It happens also with QEMU 6.0 and libvirt 7.6.

How reproducible:

100% in the tested environments

Steps to Reproduce:
1. Create a VM with the properties above (see the attachment for an example QEMU command line).
2. Try to start it.

Actual results:

The VM fails to start.

Expected results:

The VM starts.

Additional info:

It has been observed in RHV environment, see BZ 2048429 for more details.

There was also observed a different error message under similar circumstances, but with a more complex combination of passed host devices (see the bug above for details):

"vfio 0000:af:00.1: failed to setup container for group 150: memory listener initialization failed: Region ram-node0: vfio_dma_map(0x5630dd4b9af0, 0x0, 0x80000000, 0x7f5a73e00000) = -12 (Cannot allocate memory)"

Comment 1 Yanghang Liu 2022-02-07 06:09:12 UTC
Hi Milan,

It seems to me that this bug should be a invalid bug.

The intel iommu device and two PFs (which are in the same iommu group) are conflict with each other for address spaces part, which will block the vm to be started.

> -machine pc-q35-rhel8.4.0,usb=off,dump-guest-core=off,kernel_irqchip=split,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format,graphics=off \
> -device intel-iommu,intremap=on,caching-mode=on,eim=on \
> -device vfio-pci,host=0000:09:00.0,id=ua-21dbb711-7f4f-4958-b6da-f9f052587e6d,bus=pci.4,addr=0x0 \
> -device vfio-pci,host=0000:09:00.1,id=ua-b0231a3f-07ad-442c-9308-62a0fefc9b52,bus=pci.5,addr=0x0 \

The vm is likely to be started successfully after you remove the intel iommu device or the two PFs(in the same iommu group) in the vm configuration.

Can you help retest and confirm it in your test environment ?

Comment 2 Yanghang Liu 2022-02-07 06:24:25 UTC

The folowing bugs are likely to have the same root cause with this bug.

Bug 1619739 - [RFE] vfio non-singleton group + viommu support
Bug 1627499 - [RFE] Account for AddressSpace aliases due to conventional PCI buses
Bug 1715724 - The same iommu_group NICs can not be assigned to a Win2019 guest at the same time

Comment 3 Milan Zamazal 2022-02-14 20:09:07 UTC
Hi Yanghang, thank you for explanation. It indeed looks like the same cause: We add IOMMU when the maximum number of vCPUs >= 256, which is when the problem occurs if there are additionally two devices in the same IOMMU group.

IOMMU is required for max vCPUs >= 256. That means the number of vCPUs is limited when there are multiple devices in the same IOMMU group.

Comment 8 Igor Mammedov 2022-03-21 15:45:04 UTC
I don't see an obvious connection with maxcpus.
Perhaps the best person to look into it is the one who might know more about vfio&co.
CCing Alex.

Comment 9 Alex Williamson 2022-03-21 15:59:49 UTC
As YangHang correctly identifies, the addition of the vIOMMU at >=256 vCPU introduces multiple address spaces for devices, making this configuration invalid.

If we were only to assign one device from the IOMMU group, this would be a valid configuration, for instance the GPU could be installed without the audio function.

Alternatively we'd need to move to a guest PCI topology that doesn't requirement multiple device address spaces.  This can be accomplished with a pcie-to-pci bridge device.  All devices on the conventional PCI side of the bridge share an address space, therefore the restriction of a single address space within an IOMMU group is satisfied.

Comment 11 Alex Williamson 2022-03-30 20:36:19 UTC
(In reply to Amnon Ilan from comment #10)
> Alex, Is it related to bug#1619734?

No, this is an isolation and address space issue, not an accounting issue.  I wish it were an accounting issue, upstream work with iommufd should eventually fix that, though the timeline is not insignificant.

> How do you see the next steps here?

IOMMU groups represent the smallest unit of isolation for device assignment.  In some cases devices are grouped together because the IOMMU cannot distinguish separate devices, in other cases it's because we cannot conclusively determine that untranslated DMA between the devices is prevented.

The former is often a topology issue on the host, for instance host devices on a conventional PCI bus all use the same requester ID. This case cannot be solved, the devices necessarily share an IOV address space.

The latter case is more common for multi-function devices or the result of host interconnect device which might allow redirection, ie. root ports and switches.  For these cases we recommend system, interconnect, and device vendors to support PCIe Access Control Services (ACS) which allows the OS to identify, and in some cases control, the isolation.  For existing hardware, our only option is to consult with the hardware vendor to determine whether equivalent isolation exists in routing or between functions and add software quirks to the kernel to expose the inherent isolation via smaller groups.

Upstream work largely focuses on singleton groups, which is where we expect hardware designed for these sorts of use cases to converge.  So while there might be some opportunity to create separate address spaces with a group using the developments in iommufd, I don't necessarily expect that to be a focus.

The path forward here is to handle multi-devices groups on a case by case basis, identify whether the grouping is the result of the system, the interconnects, or the device itself and work with the appropriate partner to determine if an isolation quirk is appropriate.  Meanwhile, recommend devices and systems that don't have such issues to customers.

Prior to this requirement to support vIOMMU for large numbers of vCPUs, I think the worlds of VMs with both assigned GPUs and vIOMMU didn't often cross paths.  We've already worked actively to provide quirks for many NIC devices.

Comment 12 Yanghang Liu 2022-04-15 02:38:13 UTC
Hi  Milan 

May I ask if your have any other concerns for this bug ?

Is it ok for you that we close this bug ?

Comment 13 Milan Zamazal 2022-04-19 13:59:36 UTC
(In reply to Yanghang Liu from comment #12)

> Is it ok for you that we close this bug ?

Hi Yanghang, as I understand the explanations above, there is no single solution and such problems must be handled case by case. In such a case, let's close the bug. If there is a specific case in future we need to handle then we'll open a separate bug.

Comment 14 Yash Mankad 2022-05-12 17:24:44 UTC
Closing this bug as NOTABUG based on Milan's comment #c13