2048429 – VFIO_MAP_DMA failed: Cannot allocate memory -12 (VM with GPU passthrough, Q35 machine and 16 vcpus)

Bug 2048429 - VFIO_MAP_DMA failed: Cannot allocate memory -12 (VM with GPU passthrough, Q35 machine and 16 vcpus)

Summary: VFIO_MAP_DMA failed: Cannot allocate memory -12 (VM with GPU passthrough, Q35...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.5.1
Target Release:	---
Assignee:	Milan Zamazal
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:	2050175
Blocks:	2081241
TreeView+	depends on / blocked

Reported:	2022-01-31 09:05 UTC by Petr Kubica
Modified:	2022-06-23 05:54 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-engine-4.5.1
Clone Of:
Clones:	2081241 (view as bug list)
Environment:
Last Closed:	2022-06-23 05:54:58 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.5?

Attachments	(Terms of Use)
after_get_caps hook to fake the number of CPUs (612 bytes, text/x-python3) 2022-05-19 10:59 UTC, Milan Zamazal	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-engine pull 382	0	None	open	core: Add memtune hard_limit for q35 VMs with many CPUs	2022-05-18 15:10:00 UTC
Red Hat Issue Tracker	RHV-44574	0	None	None	None	2022-01-31 09:06:10 UTC

Description Petr Kubica 2022-01-31 09:05:00 UTC

Description of problem:
Cannot start a VM with configuration:
- GPU passthrough (device is pci-stubbed in grub)
- Q35 machine type 
- 16 cores per single virtual socket (s:c:th = 1:8:2)

(server has NUMA configuration - not sure if this is relevant)

VM will fail with:
2022-01-28T09:03:08.866199Z qemu-kvm: -device vfio-pci,host=0000:af:00.1,id=ua-582527a5-a6ea-475a-9c86-816f973ea027,bus=pci.6,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory
2022-01-28T09:03:08.912870Z qemu-kvm: -device vfio-pci,host=0000:af:00.1,id=ua-582527a5-a6ea-475a-9c86-816f973ea027,bus=pci.6,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory
2022-01-28T09:03:08.912990Z qemu-kvm: -device vfio-pci,host=0000:af:00.1,id=ua-582527a5-a6ea-475a-9c86-816f973ea027,bus=pci.6,addr=0x0: vfio 0000:af:00.1: failed to setup container for group 150: memory listener initialization failed: Region ram-node0: vfio_dma_map(0x5630dd4b9af0, 0x0, 0x80000000, 0x7f5a73e00000) = -12 (Cannot allocate memory)
2022-01-28 09:03:09.005+0000: shutting down, reason=failed

Workarounds:
changing to I440FX will solve this issue
changing CPU layout to more sockets also works -> (s:c:th = 2:4:2) 

Looks really like the same issue as here:
https://bugzilla.redhat.com/show_bug.cgi?id=2026893
which wasn't fixed by caching-mode=on

Version-Release number of selected component (if applicable):
qemu-kvm-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
libvirt-7.10.0-1.module_el8.6.0+1046+bd8eec5e.x86_64
vdsm-4.40.100.2-1.el8.x86_64

How reproducible:
always

Steps to Reproduce:
- GPU passthrough (device is pci-stubbed in grub)
- Q35 machine type 
- 16 cores per single virtual socket (s:c:th = 1:8:2)

Additional info:
Logs from libvirt:

LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
HOME=/var/lib/libvirt/qemu/domain-1-Windows-01-GPU0 \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-Windows-01-GPU0/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-Windows-01-GPU0/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-Windows-01-GPU0/.config \
/usr/libexec/qemu-kvm \
-name guest=Windows-01-GPU0,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-Windows-01-GPU0/master-key.aes"}' \
-blockdev '{"driver":"file","filename":"/usr/share/OVMF/OVMF_CODE.secboot.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}' \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/d5ba5e0e-339e-40bb-90d8-e34d3d158261.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}' \
-machine pc-q35-rhel8.4.0,usb=off,dump-guest-core=off,kernel_irqchip=split,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format \
-accel kvm \
-cpu host,migratable=on,vmx=on,kvm=off \
-m size=33554432k,slots=16,maxmem=134217728k \
-overcommit mem-lock=off \
-smp 16,maxcpus=256,sockets=16,dies=1,cores=8,threads=2 \
-object '{"qom-type":"iothread","id":"iothread1"}' \
-object '{"qom-type":"memory-backend-ram","id":"ram-node0","size":34359738368}' \
-numa node,nodeid=0,cpus=0-255,memdev=ram-node0 \
-uuid d5ba5e0e-339e-40bb-90d8-e34d3d158261 \
-smbios type=1,manufacturer=oVirt,product=RHEL,version=8.6-1.el8,serial=d7cb7a89-958c-2af6-c6b0-fc349767d7db,uuid=d5ba5e0e-339e-40bb-90d8-e34d3d158261,family=oVirt \
-display none \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=39,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=2022-01-28T09:03:00,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global ICH9-LPC.disable_s3=1 \
-global ICH9-LPC.disable_s4=1 \
-boot strict=on \
-device intel-iommu,intremap=on,caching-mode=on,eim=on \
-device pcie-root-port,port=8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 \
-device pcie-root-port,port=9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 \
-device pcie-root-port,port=10,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 \
-device pcie-root-port,port=11,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 \
-device pcie-root-port,port=12,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 \
-device pcie-root-port,port=13,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 \
-device pcie-root-port,port=14,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 \
-device pcie-root-port,port=15,chassis=8,id=pci.8,bus=pcie.0,addr=0x1.0x7 \
-device pcie-root-port,port=16,chassis=9,id=pci.9,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=17,chassis=10,id=pci.10,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=18,chassis=11,id=pci.11,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=19,chassis=12,id=pci.12,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=20,chassis=13,id=pci.13,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=21,chassis=14,id=pci.14,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=22,chassis=15,id=pci.15,bus=pcie.0,addr=0x2.0x6 \
-device pcie-root-port,port=23,chassis=16,id=pci.16,bus=pcie.0,addr=0x2.0x7 \
-device pcie-root-port,port=24,chassis=17,id=pci.17,bus=pcie.0,addr=0x3 \
-device pcie-pci-bridge,id=pci.18,bus=pci.4,addr=0x0 \
-device qemu-xhci,p2=8,p3=8,id=ua-2d5407e4-258d-4858-a667-7c8d68e6e079,bus=pci.2,addr=0x0 \
-device virtio-scsi-pci,iothread=iothread1,id=ua-c8f1426f-0f5c-4acb-8c65-f90ae8781eff,bus=pci.7,addr=0x0 \
-device virtio-serial-pci,id=ua-c43f87f4-49fd-409a-abb4-e243eaf3ac8e,max_ports=16,bus=pci.3,addr=0x0 \
-device ide-cd,bus=ide.2,id=ua-87b896ca-ac92-4645-bd7d-ef89927a6989,werror=report,rerror=report \
-blockdev '{"driver":"file","filename":"/rhev/data-center/mnt/localhost:_srv_rhv/bd9ac7ab-b2cc-4588-8fb9-da497907378c/images/adcac7da-7c0a-46a9-815a-8455c3327af0/0346ab51-49c8-450e-86e6-7bbc5f86ae99","aio":"threads","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \
-device scsi-hd,bus=ua-c8f1426f-0f5c-4acb-8c65-f90ae8781eff.0,channel=0,scsi-id=0,lun=0,device_id=adcac7da-7c0a-46a9-815a-8455c3327af0,drive=libvirt-2-format,id=ua-adcac7da-7c0a-46a9-815a-8455c3327af0,bootindex=1,write-cache=on,serial=adcac7da-7c0a-46a9-815a-8455c3327af0,werror=stop,rerror=stop \
-blockdev '{"driver":"file","filename":"/rhev/data-center/mnt/localhost:_srv_store_rhv/8de83773-d704-424c-b5b3-101a671c8954/images/8e146613-a4b7-41e8-b2f2-e2fbdf5246d7/e3a5b4e8-7781-4212-b98a-9f2c99e17337","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \
-device scsi-hd,bus=ua-c8f1426f-0f5c-4acb-8c65-f90ae8781eff.0,channel=0,scsi-id=0,lun=1,device_id=8e146613-a4b7-41e8-b2f2-e2fbdf5246d7,drive=libvirt-1-format,id=ua-8e146613-a4b7-41e8-b2f2-e2fbdf5246d7,write-cache=on,serial=8e146613-a4b7-41e8-b2f2-e2fbdf5246d7,werror=stop,rerror=stop \
-netdev tap,fds=41:43:44:45,id=hostua-daac2d8b-38cf-4ac0-b48c-b6fedebe2944,vhost=on,vhostfds=46:47:48:49 \
-device virtio-net-pci,mq=on,vectors=10,host_mtu=1500,netdev=hostua-daac2d8b-38cf-4ac0-b48c-b6fedebe2944,id=ua-daac2d8b-38cf-4ac0-b48c-b6fedebe2944,mac=56:6f:ca:5e:00:00,bus=pci.1,addr=0x0 \
-chardev socket,id=charchannel0,fd=50,server=on,wait=off \
-device virtserialport,bus=ua-c43f87f4-49fd-409a-abb4-e243eaf3ac8e.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 \
-chardev socket,id=charchannel1,fd=51,server=on,wait=off \
-device virtserialport,bus=ua-c43f87f4-49fd-409a-abb4-e243eaf3ac8e.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 \
-audiodev '{"id":"audio1","driver":"none"}' \
-device vfio-pci,host=0000:04:00.0,id=ua-3f3c1a1c-4b88-47b5-9119-f4056a96e93c,bus=pci.5,addr=0x0 \
-device vfio-pci,host=0000:af:00.1,id=ua-582527a5-a6ea-475a-9c86-816f973ea027,bus=pci.6,addr=0x0 \
-device vfio-pci,host=0000:3c:00.0,id=ua-5c49a9c0-8557-447a-9189-97bb3952a062,bus=pci.10,addr=0x0 \
-device vfio-pci,host=0000:af:00.0,id=ua-d3af7b38-0840-425b-8fcd-773e0d7dd03c,bus=pci.8,addr=0x0 \
-device virtio-balloon-pci,id=ua-71e3afb3-6c29-4201-87a3-5ed93fd873b1,bus=pci.9,addr=0x0 \
-object '{"qom-type":"rng-random","id":"objua-e8be8bc4-167f-4223-8628-b218a38c9ead","filename":"/dev/urandom"}' \
-device virtio-rng-pci,rng=objua-e8be8bc4-167f-4223-8628-b218a38c9ead,id=ua-e8be8bc4-167f-4223-8628-b218a38c9ead,bus=pci.11,addr=0x0 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2022-01-28T09:03:08.866199Z qemu-kvm: -device vfio-pci,host=0000:af:00.1,id=ua-582527a5-a6ea-475a-9c86-816f973ea027,bus=pci.6,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory
2022-01-28T09:03:08.912870Z qemu-kvm: -device vfio-pci,host=0000:af:00.1,id=ua-582527a5-a6ea-475a-9c86-816f973ea027,bus=pci.6,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory
2022-01-28T09:03:08.912990Z qemu-kvm: -device vfio-pci,host=0000:af:00.1,id=ua-582527a5-a6ea-475a-9c86-816f973ea027,bus=pci.6,addr=0x0: vfio 0000:af:00.1: failed to setup container for group 150: memory listener initialization failed: Region ram-node0: vfio_dma_map(0x5630dd4b9af0, 0x0, 0x80000000, 0x7f5a73e00000) = -12 (Cannot allocate memory)
2022-01-28 09:03:09.005+0000: shutting down, reason=failed

Comment 1 Milan Zamazal 2022-01-31 17:28:52 UTC

I cannot reproduce the problem. I tried to reproduce it on: 1. a host, which is a VM with an emulated GPU, with the same QEMU version; 2. a bare metal host, passing its real GPU, with QEMU 6.0. The VM starts for me in both the cases. Using q35 with UEFI.

Petr, do you have an idea what could be special in your environment? Looking at possibly significant differences on the QEMU command lines, I can see you have more RAM in the VM (I used just 1 GB) and you have multiple host devices. Does the problem appear when you create a new plain default VM and just pass a GPU to it?

Comment 2 Petr Kubica 2022-02-01 21:37:07 UTC

I tried a simple VM, I just created an additional VM
- configured only number of sockets,cores and threads
- added a GPU (also with HDMI audio - a device within the same IOMMU group)

Now it is failing with different error, but changing CPU topology or using I440FX would help as with the previous error message
2022-02-01T21:17:12.654397Z qemu-kvm: -device vfio-pci,host=0000:3b:00.1,id=ua-f97d6e80-287f-4e57-b739-2719cbfaf6d1,bus=pci.5,addr=0x0: vfio 0000:3b:00.1: group 77 used in multiple address spaces
2022-02-01 21:17:13.738+0000: shutting down, reason=failed
(added complete output at the end of this comment)

But I still have a VM, which is producing the mentioned error in description (two different errors, same behavior at the same time)

I'm actually not sure how the environment could be special. 
From HW perspective it should work:
- only one thing which could be considered as special is using GTX 1080 graphic cards which are considered as gaming instead of computing units. But NVIDIA half year ago provided necessary drivers for them to be able to run in virtualized environment with passthrough

From SW perspective:
- Currently I have installed host and engine at the same machine (running together within a single OS) I know, this is unsupported.. but I'm guessing it shouldn't have an affect to virtualization capabilities.


Complete log from libvirt:
2022-02-01 21:17:12.326+0000: starting up libvirt version: 7.10.0, package: 1.module_el8.6.0+1046+bd8eec5e (CentOS Buildsys <bugs>, 2021-12-13-15:33:57, ), qemu version: 6.0.0qemu-kvm-6.0.0-33.el8s, kernel: 4.18.0-358.el8.x86_64, hostname: nanoxia.teranode.cz
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
HOME=/var/lib/libvirt/qemu/domain-11-TestVM \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-11-TestVM/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-11-TestVM/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-11-TestVM/.config \
/usr/libexec/qemu-kvm \
-name guest=TestVM,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-11-TestVM/master-key.aes"}' \
-machine pc-q35-rhel8.4.0,usb=off,dump-guest-core=off,kernel_irqchip=split \
-accel kvm \
-cpu Cascadelake-Server-noTSX,mpx=off,vmx=on \
-m size=1048576k,slots=16,maxmem=4194304k \
-overcommit mem-lock=off \
-smp 16,maxcpus=256,sockets=16,dies=1,cores=8,threads=2 \
-object '{"qom-type":"iothread","id":"iothread1"}' \
-object '{"qom-type":"memory-backend-ram","id":"ram-node0","size":1073741824}' \
-numa node,nodeid=0,cpus=0-255,memdev=ram-node0 \
-uuid cc06a3b6-23fa-4655-9583-c5b1394a384e \
-smbios type=1,manufacturer=oVirt,product=RHEL,version=8.6-1.el8,serial=d7cb7a89-958c-2af6-c6b0-fc349767d7db,uuid=cc06a3b6-23fa-4655-9583-c5b1394a384e,family=oVirt \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=40,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=2022-02-01T21:17:12,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global ICH9-LPC.disable_s3=1 \
-global ICH9-LPC.disable_s4=1 \
-boot strict=on \
-device intel-iommu,intremap=on,caching-mode=on,eim=on \
-device pcie-root-port,port=16,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=17,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=18,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=19,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=20,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=21,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=22,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \
-device pcie-root-port,port=23,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x7 \
-device pcie-root-port,port=24,chassis=9,id=pci.9,bus=pcie.0,multifunction=on,addr=0x3 \
-device pcie-root-port,port=25,chassis=10,id=pci.10,bus=pcie.0,addr=0x3.0x1 \
-device pcie-root-port,port=26,chassis=11,id=pci.11,bus=pcie.0,addr=0x3.0x2 \
-device pcie-root-port,port=27,chassis=12,id=pci.12,bus=pcie.0,addr=0x3.0x3 \
-device pcie-root-port,port=28,chassis=13,id=pci.13,bus=pcie.0,addr=0x3.0x4 \
-device pcie-root-port,port=29,chassis=14,id=pci.14,bus=pcie.0,addr=0x3.0x5 \
-device pcie-root-port,port=30,chassis=15,id=pci.15,bus=pcie.0,addr=0x3.0x6 \
-device pcie-root-port,port=31,chassis=16,id=pci.16,bus=pcie.0,addr=0x3.0x7 \
-device qemu-xhci,p2=8,p3=8,id=ua-094a02a3-a1bf-4666-9e32-cc0a6f453bf2,bus=pci.1,addr=0x0 \
-device virtio-scsi-pci,iothread=iothread1,id=ua-a8dbd624-62f0-4a99-bedc-9ee1dd31c696,bus=pci.3,addr=0x0 \
-device virtio-serial-pci,id=ua-14377d59-66cf-4d3b-902a-79c13eac3373,max_ports=16,bus=pci.2,addr=0x0 \
-device ide-cd,bus=ide.2,id=ua-77570a79-f1a9-4399-80b5-c130a7492ae2,werror=report,rerror=report \
-blockdev '{"driver":"file","filename":"/rhev/data-center/mnt/localhost:_srv_rhv/bd9ac7ab-b2cc-4588-8fb9-da497907378c/images/c18a08e3-ea43-42a9-b861-b4aaf631aeb7/68c9cbec-2940-489c-8639-65f4a6995634","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \
-device scsi-hd,bus=ua-a8dbd624-62f0-4a99-bedc-9ee1dd31c696.0,channel=0,scsi-id=0,lun=0,device_id=c18a08e3-ea43-42a9-b861-b4aaf631aeb7,drive=libvirt-1-format,id=ua-c18a08e3-ea43-42a9-b861-b4aaf631aeb7,bootindex=1,write-cache=on,serial=c18a08e3-ea43-42a9-b861-b4aaf631aeb7,werror=stop,rerror=stop \
-chardev socket,id=charchannel0,fd=51,server=on,wait=off \
-device virtserialport,bus=ua-14377d59-66cf-4d3b-902a-79c13eac3373.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 \
-chardev socket,id=charchannel1,fd=52,server=on,wait=off \
-device virtserialport,bus=ua-14377d59-66cf-4d3b-902a-79c13eac3373.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 \
-chardev spicevmc,id=charchannel2,name=vdagent \
-device virtserialport,bus=ua-14377d59-66cf-4d3b-902a-79c13eac3373.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 \
-device usb-tablet,id=input0,bus=ua-094a02a3-a1bf-4666-9e32-cc0a6f453bf2.0,port=1 \
-audiodev '{"id":"audio1","driver":"spice"}' \
-vnc 192.168.255.10:0,password=on,audiodev=audio1 \
-k en-us \
-spice port=5901,tls-port=5902,addr=192.168.255.10,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on \
-device qxl-vga,id=ua-fc879323-fd1a-450e-ac0c-ba25b10bd0d3,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 \
-device vfio-pci,host=0000:3b:00.0,id=ua-ee072d9a-53cc-4d6b-8e31-a77ed85ef9b5,bus=pci.4,addr=0x0 \
-device vfio-pci,host=0000:3b:00.1,id=ua-f97d6e80-287f-4e57-b739-2719cbfaf6d1,bus=pci.5,addr=0x0 \
-device virtio-balloon-pci,id=ua-41ea4488-54d8-406e-aaa5-071590085e60,bus=pci.6,addr=0x0 \
-object '{"qom-type":"rng-random","id":"objua-3693924a-8ffa-4048-b05e-8450763d998a","filename":"/dev/urandom"}' \
-device virtio-rng-pci,rng=objua-3693924a-8ffa-4048-b05e-8450763d998a,id=ua-3693924a-8ffa-4048-b05e-8450763d998a,bus=pci.7,addr=0x0 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2022-02-01T21:17:12.654397Z qemu-kvm: -device vfio-pci,host=0000:3b:00.1,id=ua-f97d6e80-287f-4e57-b739-2719cbfaf6d1,bus=pci.5,addr=0x0: vfio 0000:3b:00.1: group 77 used in multiple address spaces
2022-02-01 21:17:13.738+0000: shutting down, reason=failed

I will add information about iommu groups as an attachment

Comment 3 Milan Zamazal 2022-02-03 08:48:52 UTC

Based on the information from Comment 2 and examination of the problem in Petr's environment, I could reproduce the error "group N used in multiple address spaces", under the following circumstances:

- q35 chipset
- max vCPUS >= 256
- two passthrough devices from the same IOMMU group

If any of these conditions is not met, the VM starts, so it looks like a QEMU bug.

The original error, "-12 (Cannot allocate memory)", apparently occurs when a GPU equipped with an audio device (both in the same IOMMU group) is passed through. Since it happens under the same circumstances, it's probably related.

Comment 4 Milan Zamazal 2022-02-03 12:18:02 UTC

(In reply to Milan Zamazal from comment #3)

> The original error, "-12 (Cannot allocate memory)", apparently occurs when a
> GPU equipped with an audio device (both in the same IOMMU group) is passed
> through.

Actually not just that, see Petr's comments above.

> it looks like a QEMU bug

Reported: BZ 2050175

Comment 5 Yanghang Liu 2022-02-11 03:58:12 UTC

This bug reminders me of two bugs which I have handled before.

Please take a look and see if it's a bug with the same root cause.

Bug 1912093 - Failed to hotplug 2 PFs into a vm which has an iommu device.

Bug 1619734 - [RFE] vfio page accounting enhancements for viommu

Comment 6 Michal Skrivanek 2022-04-08 16:39:28 UTC

didn't make it in time for 4.5.0, deferring

Comment 9 Alex Williamson 2022-05-03 14:50:39 UTC

It seems there are two issues being confused here, but both are likely a result of creating configurations which automatically introduce a vIOMMU into the VM configuration.  The vIOMMU is required when we exceed specific vCPU counts which require support for x2apic.  A side-effect of enabling the vIOMMU is that it provides per-device address spaces in QEMU.  The effect of this is two-fold relative to assigned devices.

First, locked memory accounting is done per vfio container, where QEMU necessarily creates a vfio container per supported device address space.  There is no sharing of pinned page accounting between containers.  Therefore a VM with, for example, 4 assigned devices in a vIOMMU configuration needs to be able to lock 4x the VM RAM size.  I suspect this is where we're seeing "VFIO_MAP_DMA failed: Cannot allocate memory" errors.  The host dmesg log should be able to confirm this.

We do not expect to have a fix for this until we switch to using an iommufd interface on the host, which is currently RFC upstream.  The current solution is to increase the VM locked memory capabilities to account for this.

The second issue is that our granularity of attaching devices to address spaces is the IOMMU group.  Devices are grouped together based on a lack of discernible isolation between the devices.  This might be a result of a multi-function device which does not expose PCIe ACS (Access Control Services) to prove isolation, a PCIe host topology that does not support ACS to enforce isolation, or a conventional PCI topology on the host which implicitly cannot provide isolation.  Therefore, when we create a VM configuration that introduces per-device address spaces, we need to account for the vfio group composition.  In the case of assigning multiple functions from a graphics card, ex. GPU and audio functions, where the device does not support ACS to prove isolation, those devices cannot be placed into separate address spaces.

Possible solutions to this include:

 - Only make use of configurations that do not require a vIOMMU for such devices.

 - Assign only one function from the device, for example leaving the other function(s) unused on the host and bound to pci-stub or the vfio-pci driver.

 - Move all devices in the group under a PCIe-to-PCI bridge in the guest topology, which serves to reduce the configuration to a shared address space for those devices.

 - Work with the device hardware vendor to determine whether ACS equivalent isolation exists between the functions of the device and if so, implement quirks in the host kernel PCI code to expose that isolation, splitting the IOMMU group.

Comment 10 Milan Zamazal 2022-05-05 09:47:38 UTC

Thank you, Alex, for comprehensive explanation. Could you please clarify the memory locking issue?

> QEMU necessarily creates a vfio container per supported device address space ... a VM with, for example, 4 assigned devices in a vIOMMU configuration needs to be able to lock 4x the VM RAM size.

Does it mean that the vIOMMU address space size for a device corresponds to the VM RAM size?

> The current solution is to increase the VM locked memory capabilities to account for this.

What solution do you mean here? Something already available or something being worked on? And what are the VM locked memory capabilities you talk about?

All I can see in the libvirt documentation is the possibility of locking all the VM memory pages on the host (https://libvirt.org/formatdomain.html#elementsMemoryBacking), which is not recommended to use. I guess this is not very helpful with this problem. Is there anything else that can be done, perhaps in QEMU only?

I guess we haven't hit the locked memory problem yet and all what we've experienced so far is the second problem with IOMMU groups. But let's be ready for it.

Comment 11 Alex Williamson 2022-05-05 14:26:05 UTC

(In reply to Milan Zamazal from comment #10)
> Thank you, Alex, for comprehensive explanation. Could you please clarify the
> memory locking issue?

This is the locked memory limits for the VM process as accessible via prlimit -l

> > QEMU necessarily creates a vfio container per supported device address space ... a VM with, for example, 4 assigned devices in a vIOMMU configuration needs to be able to lock 4x the VM RAM size.
> 
> Does it mean that the vIOMMU address space size for a device corresponds to
> the VM RAM size?

Essentially, yes.
 
> > The current solution is to increase the VM locked memory capabilities to account for this.
> 
> What solution do you mean here? Something already available or something
> being worked on? And what are the VM locked memory capabilities you talk
> about?

The current solution is essentially a workaround of using <memtune> in the VM xml to increase the hard_limit, which affects the locked memory limit applied to the QEMU process.  prlimit is also an option for a hot-add scenario.

> All I can see in the libvirt documentation is the possibility of locking all
> the VM memory pages on the host
> (https://libvirt.org/formatdomain.html#elementsMemoryBacking), which is not
> recommended to use. I guess this is not very helpful with this problem. Is
> there anything else that can be done, perhaps in QEMU only?

The locked memory limit is something imposed on QEMU, QEMU cannot do anything about this.  The workaround is to increase the VM locked memory limits, the long term solution will be a new IOMMU interface that QEMU will make use of that allows multiple address spaces within the same context and can therefore avoid duplicate pinned memory accounting.  This solution is some time off, has not been accepted upstream yet, nor is QEMU or libvirt support available.

> I guess we haven't hit the locked memory problem yet and all what we've
> experienced so far is the second problem with IOMMU groups. But let's be
> ready for it.

I suspect comment 0 is an example of the locked memory limit issue.

Comment 12 Milan Zamazal 2022-05-06 09:09:22 UTC

Thank you for clarification. Let's summarize the current oVirt/RHV concerns regarding the locked memory limits, as far as I understand it:

- If we don't have it already, we are going to have trouble with q35 + vCPUs >= 256 + device using VFIO.

- We can use <hard_limit> (https://libvirt.org/formatdomain.html#memory-tuning) to work around the issue, but it's most likely to be a not fully reliable guess work.

- The limit must be able to include the VM RAM size for each VFIO device.

- Memory hot plugging must be also considered. The easiest way would be to set <hard_limit> based on the VM maximum (rather than current) memory. The simplest solution would be to set a sufficiently high <hard_limit>. As I understand the documentation, the only problem could be that the guest can keep that amount of guest memory locked in the host memory, preventing if from swapping out, which is not a problem for us because we avoid swapping. In QEMU, it allows lifting the locked memory *limit*, so this shouldn't be a problem too.

- A QEMU solution is planned for this but it's still far away and won't be available for RHV for sure.

- Alex suspects that we've already hit the issue in this bug. There is a customer case in BZ 2081241 with a very similar error but I cannot see anything very obviously related in the dmesg output there (well, there can be something less obvious for me there or it may not be present in dmesg).

It would be best to look at a QE environment exposing the problem.

Petr, do you think you would be able to use the environment again and reproduce the problem there?

Comment 13 Alex Williamson 2022-05-06 15:24:53 UTC

(In reply to Milan Zamazal from comment #12)
> 
> - Alex suspects that we've already hit the issue in this bug.  There is a
> customer case in BZ 2081241 with a very similar error but I cannot see
> anything very obviously related in the dmesg output there (well, there can
> be something less obvious for me there or it may not be present in dmesg).

What you're looking for in dmesg is something like:

    vfio_pin_pages_remote: RLIMIT_MEMLOCK (398274330624) exceeded

Which I do see some of in the most recent sosreport related to that issue.  Those usually correspond to the -ENOMEM, cannot allocate memory, errors seen in the upper level logs.  RLIMIT_MEMLOCK is the internal representation of the locked memory limit.

Comment 14 Milan Zamazal 2022-05-09 08:19:03 UTC

Ah, I see, I can see it now, thanks again.

So we have a customer issue, which should be hopefully solvable by reducing the excessive maximum number of vCPUs this time. We should get ready for setups with many vCPUs in the way advised by Alex in the preceding comments.

I'll see whether I can reproduce *this* kind of issue on my setup to check whether the suggested workaround with adjusting the limits works in oVirt.

Comment 15 Milan Zamazal 2022-05-17 12:50:08 UTC

I was able to reproduce the error and to check that the VM starts when hard_limit is set to a very high value.

Comment 16 Milan Zamazal 2022-05-19 10:59:28 UTC

Created attachment 1881298 [details]
after_get_caps hook to fake the number of CPUs

A hook that can be used to make oVirt think there are many CPUs on a host. Not everything may work correctly with it but it is good enough to test this bug.

Comment 17 Milan Zamazal 2022-05-19 11:06:46 UTC

(In reply to Milan Zamazal from comment #15)
> I was able to reproduce the error and to check that the VM starts when
> hard_limit is set to a very high value.

Notes to QE on how to reproduce the memory locking issue:

- I used a hook posted in the previous comment to fake the number of host CPUs.

- I created a q35 VM with as much RAM as possible (needn't be that much, my hardware allowed only 8 GB).

- I passed through a GPU host device and a VFIO NIC. The more devices the better but it's important to check that the reported error is about memory allocation (best to check in dmesg, see Comment 13) and not the other one about multiple devices in a single IOMMU group, it's better to pass through devices from different IOMMU groups if possible.

Comment 18 Nisim Simsolo 2022-06-14 08:11:08 UTC

Verified:
ovirt-engine-4.5.1.1-0.14.el8ev
vdsm-4.50.1.2-1.el8ev.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+15489+bc23efef.1.x86_64
libvirt-daemon-8.0.0-5.2.module+el8.6.0+15256+3a0914fe.x86_64

Verification scenario:
1. Create VM with: Q35 UEFI,  16GB memory size and 16 CPUs (s:c:th = 1:8:2)
(if required, use hook from bug attachments)
2. add VFIO and passthrough NICs to VM host devices.
3. Run VM.
Verify VM is running with host devices, for example:
09:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [Quadro K4200] (rev a1)
06:00.0 Audio device: NVIDIA Corporation GK104 HDMI Audio Controller (rev a1) 
07:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
08:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
Observe vdsm.log, libvirt.log, enging.log and verify there are no errors.
Observe dmesg and verify there's no message like vfio_pin_pages_remote: RLIMIT_MEMLOCK (398274330624) exceeded (see https://bugzilla.redhat.com/show_bug.cgi?id=2048429#c13).
4.repeat step 3, this time change CPUs to 32 (1:16:2) and run VM. 
5.repeat step 3, this time change CPUs to 64 (1:32:2) and run VM.

Comment 19 Sandro Bonazzola 2022-06-23 05:54:58 UTC

This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.