2079760 – vGPU: VM failed to run with more than one vGPU instance

Bug 2079760 - vGPU: VM failed to run with more than one vGPU instance

Summary: vGPU: VM failed to run with more than one vGPU instance

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.5.0.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.5.1
Target Release:	---
Assignee:	Milan Zamazal
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-28 08:06 UTC by Nisim Simsolo
Modified:	2022-08-03 15:18 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ovirt-engine-4.5.1.2
Clone Of:
Environment:
Last Closed:	2022-06-23 05:54:58 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.5? pm-rhel: devel_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-engine pull 460	None	open	core: Make sure at most one ramfb device is enabled	2022-06-13 09:55:08 UTC
Red Hat Issue Tracker	RHV-45912	None	None	None	2022-05-02 16:23:32 UTC
Red Hat Knowledge Base (Solution)	6970563	None	None	None	2022-08-03 15:18:00 UTC

Description Nisim Simsolo 2022-04-28 08:06:46 UTC

Description of problem:
- Trying to run VM with 2 vGPU instances failed with libvirtError (from vdsm.log):
2022-04-28 10:51:37,947+0300 ERROR (vm/8761f4dc) [virt.vm] (vmId='8761f4dc-3ed6-4fad-8b77-ccbbcc0deafb') The vm start process failed (vm:1010)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 937, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 2849, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 1385, in createWithFlags
    raise libvirtError('virDomainCreateWithFlags() failed')
libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2022-04-28T07:51:32.394495Z qemu-kvm: -device vfio-pci-nohotplug,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/14b43053-f75e-49c7-bb1b-848b8d1fe63e,display=on,ramfb=on,
bus=pci.6,addr=0x0: warning: vfio 14b43053-f75e-49c7-bb1b-848b8d1fe63e: Could not enable error recovery for the device

- After disabling ramfb in engine db, the VM succeeds to run with more than 1 vGPU instance (see engine.log 2022-04-28 10:46:44,209+03 INFO  [org.ovirt.engine.core.bll.RunVmCommand])

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.0.4-0.1.el8ev
vdsm-4.50.0.13-1.el8ev.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64
libvirt-daemon-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64
Nvidia drivers 14.0 GA (NVIDIA-vGPU-rhel-8.5-510.47.03.x86_64)

How reproducible:
100%

Steps to Reproduce:
1. Run VM with 2 vGPU instances.
2.
3.

Actual results:
VM failed to run

Expected results:
VM should support eunning multiple vGPU devices.

Additional info:
vdsm.log, engine.log and libvirt/qemu/vm.log attached.

Comment 4 Guo, Zhiyi 2022-04-28 08:52:03 UTC

@Gerd, could you help to check this? Do we support to use a VM with multiple ramfb devices?

Comment 5 Nisim Simsolo 2022-04-28 11:12:17 UTC

(In reply to Guo, Zhiyi from comment #4)
> @Gerd, could you help to check this? Do we support to use a VM with multiple
> ramfb devices?

It might be a regression because it worked for me after ramfb was enabled by default when using display=on (see https://bugzilla.redhat.com/show_bug.cgi?id=1958081).

Comment 6 Gerd Hoffmann 2022-04-28 11:18:47 UTC

(In reply to Guo, Zhiyi from comment #4)
> @Gerd, could you help to check this? Do we support to use a VM with multiple
> ramfb devices?

No.  Only one of the two vgpu instances should have ramfb enabled.

Comment 7 Jaroslav Suchanek 2022-04-28 11:25:23 UTC

Jonathon, please triage this one. Thanks.

Comment 12 Arik 2022-04-29 06:53:51 UTC

Thanks Greg
Lucia, sounds like we can leverage the ability to specify the parameters per-device now. What do you think about adding the notion of ramfb to the vgpu dialog and ensure it is specified in a single device? (I'd still enable it by default as we do now for the first device that is set with display but allow the user to switch it to a difference device with display=on)

Comment 14 Lucia Jelinkova 2022-04-29 10:41:38 UTC

Arik, in the UI, it is not possible to specify the parameters per device - all configuration is common for all devices. We would need to redesign the whole dialog to enable configuration per device.

We can either set ramfb=true only on one device in LibvirtVmXmlBuilder or we can make it configurable in the UI but also set it only on 1 device.

Comment 15 Arik 2022-05-01 07:40:48 UTC

(In reply to Lucia Jelinkova from comment #14)
> Arik, in the UI, it is not possible to specify the parameters per device -
> all configuration is common for all devices. We would need to redesign the
> whole dialog to enable configuration per device.

Yeah, I know we didn't change the UI to leverage the ability of setting parameters per device which was added to the backend
I'm not suggesting to change it for every configuration, it still sounds like a big effort - but only for this one (ramfb) which is not mentioned on the UI side at the moment
 
> We can either set ramfb=true only on one device in LibvirtVmXmlBuilder or we
> can make it configurable in the UI but also set it only on 1 device.

Setting it on one device in LibvirtVmXmlBuilder is a possible workaround but it's not that great to do that behind the scenes, without the user noticing that
I'd prefer the latter - when display=on is set in the dialog, there would be one device that is set with ramfb by default, the user would be able to change that or switch that setting to another device so eventually the UI will either set it on a single device or not set it at all

Comment 17 Lucia Jelinkova 2022-05-02 09:00:08 UTC

I agree that we can add a new ramfb option to the dialog but AFAIK there is no way how to distinguish the devices (they are all the same as far as the UI is concerned) so we will always just pick a random 1 device and set that option there.

Comment 18 Jaroslav Suchanek 2022-05-02 11:28:13 UTC

Fix proposal by Jonathon sent to libvirt upstream mailing list:
https://listman.redhat.com/archives/libvir-list/2022-April/230616.html

Comment 19 Jonathon Jongsma 2022-05-02 14:51:47 UTC

As Jarda mentioned, I did send a patch upstream to libvirt to enforce this restriction. But that's mostly a user experience improvement. It will warn you that only one vgpu can have ramfb set instead of failing with a more obscure error when you attempt to start qemu. The bigger issue is that the management software is configuring things incorrectly. So it seems to me that perhaps this bug should be transferred back to a different software component to fix the issue being discussed by @ljelinko and @ahadas

Comment 20 Arik 2022-05-02 16:15:22 UTC

Yeah, I agree
We discussed it offline and seems that it would be best to pick a random vgpu device and set it with ramfb in LibvirtVmXmlBuilder

Comment 21 Nisim Simsolo 2022-06-19 16:48:09 UTC

Verified:
ovirt-engine-4.5.1.2-0.11.el8ev
vdsm-4.50.1.3-1.el8ev.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+15489+bc23efef.1.x86_64
libvirt-8.0.0-5.2.module+el8.6.0+15256+3a0914fe.x86_64
NVIDIA-vGPU-rhel-8.6-510.73.06.x86_64

Verification scenario:
1. Run VM with maximum vGPU instances available.
2. Verify VM is running
   Verifiy console is showing Initial boot (TianoCore and boot start).
   After VM is booted, verify console is showing OS UI and nvidia drivers are running.
   Observe VM domxml and verify ramfd is enabled only on one of the mdevs, for example:
 
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on' ramfb='on'>
      <source>
        <address uuid='639a9969-087b-4f4d-9806-506716125e6b'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
      <source>
        <address uuid='bfcafdde-5aa7-4f3f-85ab-0713c1fa51ac'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
      <source>
        <address uuid='e374af9c-04fa-4bd1-868e-d406fab0447d'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
      <source>
        <address uuid='e88eb4ff-1659-4918-9137-c7d85f091c46'/>
      </source>
      <alias name='hostdev3'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </hostdev>

3. Power off VM and repeat steps 1-2 few more times.

Comment 22 Sandro Bonazzola 2022-06-23 05:54:58 UTC

This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.