Description of problem: - Trying to run VM with 2 vGPU instances failed with libvirtError (from vdsm.log): 2022-04-28 10:51:37,947+0300 ERROR (vm/8761f4dc) [virt.vm] (vmId='8761f4dc-3ed6-4fad-8b77-ccbbcc0deafb') The vm start process failed (vm:1010) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 937, in _startUnderlyingVm self._run() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 2849, in _run dom.createWithFlags(flags) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 1385, in createWithFlags raise libvirtError('virDomainCreateWithFlags() failed') libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2022-04-28T07:51:32.394495Z qemu-kvm: -device vfio-pci-nohotplug,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/14b43053-f75e-49c7-bb1b-848b8d1fe63e,display=on,ramfb=on, bus=pci.6,addr=0x0: warning: vfio 14b43053-f75e-49c7-bb1b-848b8d1fe63e: Could not enable error recovery for the device - After disabling ramfb in engine db, the VM succeeds to run with more than 1 vGPU instance (see engine.log 2022-04-28 10:46:44,209+03 INFO [org.ovirt.engine.core.bll.RunVmCommand]) Version-Release number of selected component (if applicable): ovirt-engine-4.5.0.4-0.1.el8ev vdsm-4.50.0.13-1.el8ev.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64 libvirt-daemon-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64 Nvidia drivers 14.0 GA (NVIDIA-vGPU-rhel-8.5-510.47.03.x86_64) How reproducible: 100% Steps to Reproduce: 1. Run VM with 2 vGPU instances. 2. 3. Actual results: VM failed to run Expected results: VM should support eunning multiple vGPU devices. Additional info: vdsm.log, engine.log and libvirt/qemu/vm.log attached.
@Gerd, could you help to check this? Do we support to use a VM with multiple ramfb devices?
(In reply to Guo, Zhiyi from comment #4) > @Gerd, could you help to check this? Do we support to use a VM with multiple > ramfb devices? It might be a regression because it worked for me after ramfb was enabled by default when using display=on (see https://bugzilla.redhat.com/show_bug.cgi?id=1958081).
(In reply to Guo, Zhiyi from comment #4) > @Gerd, could you help to check this? Do we support to use a VM with multiple > ramfb devices? No. Only one of the two vgpu instances should have ramfb enabled.
Jonathon, please triage this one. Thanks.
Thanks Greg Lucia, sounds like we can leverage the ability to specify the parameters per-device now. What do you think about adding the notion of ramfb to the vgpu dialog and ensure it is specified in a single device? (I'd still enable it by default as we do now for the first device that is set with display but allow the user to switch it to a difference device with display=on)
Arik, in the UI, it is not possible to specify the parameters per device - all configuration is common for all devices. We would need to redesign the whole dialog to enable configuration per device. We can either set ramfb=true only on one device in LibvirtVmXmlBuilder or we can make it configurable in the UI but also set it only on 1 device.
(In reply to Lucia Jelinkova from comment #14) > Arik, in the UI, it is not possible to specify the parameters per device - > all configuration is common for all devices. We would need to redesign the > whole dialog to enable configuration per device. Yeah, I know we didn't change the UI to leverage the ability of setting parameters per device which was added to the backend I'm not suggesting to change it for every configuration, it still sounds like a big effort - but only for this one (ramfb) which is not mentioned on the UI side at the moment > We can either set ramfb=true only on one device in LibvirtVmXmlBuilder or we > can make it configurable in the UI but also set it only on 1 device. Setting it on one device in LibvirtVmXmlBuilder is a possible workaround but it's not that great to do that behind the scenes, without the user noticing that I'd prefer the latter - when display=on is set in the dialog, there would be one device that is set with ramfb by default, the user would be able to change that or switch that setting to another device so eventually the UI will either set it on a single device or not set it at all
I agree that we can add a new ramfb option to the dialog but AFAIK there is no way how to distinguish the devices (they are all the same as far as the UI is concerned) so we will always just pick a random 1 device and set that option there.
Fix proposal by Jonathon sent to libvirt upstream mailing list: https://listman.redhat.com/archives/libvir-list/2022-April/230616.html
As Jarda mentioned, I did send a patch upstream to libvirt to enforce this restriction. But that's mostly a user experience improvement. It will warn you that only one vgpu can have ramfb set instead of failing with a more obscure error when you attempt to start qemu. The bigger issue is that the management software is configuring things incorrectly. So it seems to me that perhaps this bug should be transferred back to a different software component to fix the issue being discussed by @ljelinko and @ahadas
Yeah, I agree We discussed it offline and seems that it would be best to pick a random vgpu device and set it with ramfb in LibvirtVmXmlBuilder
Verified: ovirt-engine-4.5.1.2-0.11.el8ev vdsm-4.50.1.3-1.el8ev.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+15489+bc23efef.1.x86_64 libvirt-8.0.0-5.2.module+el8.6.0+15256+3a0914fe.x86_64 NVIDIA-vGPU-rhel-8.6-510.73.06.x86_64 Verification scenario: 1. Run VM with maximum vGPU instances available. 2. Verify VM is running Verifiy console is showing Initial boot (TianoCore and boot start). After VM is booted, verify console is showing OS UI and nvidia drivers are running. Observe VM domxml and verify ramfd is enabled only on one of the mdevs, for example: <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on' ramfb='on'> <source> <address uuid='639a9969-087b-4f4d-9806-506716125e6b'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'> <source> <address uuid='bfcafdde-5aa7-4f3f-85ab-0713c1fa51ac'/> </source> <alias name='hostdev1'/> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'> <source> <address uuid='e374af9c-04fa-4bd1-868e-d406fab0447d'/> </source> <alias name='hostdev2'/> <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'> <source> <address uuid='e88eb4ff-1659-4918-9137-c7d85f091c46'/> </source> <alias name='hostdev3'/> <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/> </hostdev> 3. Power off VM and repeat steps 1-2 few more times.
This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.