Bug 2079760 - vGPU: VM failed to run with more than one vGPU instance
Summary: vGPU: VM failed to run with more than one vGPU instance
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.5.0.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.1
: ---
Assignee: Milan Zamazal
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-28 08:06 UTC by Nisim Simsolo
Modified: 2022-08-03 15:18 UTC (History)
13 users (show)

Fixed In Version: ovirt-engine-4.5.1.2
Clone Of:
Environment:
Last Closed: 2022-06-23 05:54:58 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.5?
pm-rhel: devel_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 460 0 None open core: Make sure at most one ramfb device is enabled 2022-06-13 09:55:08 UTC
Red Hat Issue Tracker RHV-45912 0 None None None 2022-05-02 16:23:32 UTC
Red Hat Knowledge Base (Solution) 6970563 0 None None None 2022-08-03 15:18:00 UTC

Description Nisim Simsolo 2022-04-28 08:06:46 UTC
Description of problem:
- Trying to run VM with 2 vGPU instances failed with libvirtError (from vdsm.log):
2022-04-28 10:51:37,947+0300 ERROR (vm/8761f4dc) [virt.vm] (vmId='8761f4dc-3ed6-4fad-8b77-ccbbcc0deafb') The vm start process failed (vm:1010)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 937, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 2849, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 1385, in createWithFlags
    raise libvirtError('virDomainCreateWithFlags() failed')
libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2022-04-28T07:51:32.394495Z qemu-kvm: -device vfio-pci-nohotplug,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/14b43053-f75e-49c7-bb1b-848b8d1fe63e,display=on,ramfb=on,
bus=pci.6,addr=0x0: warning: vfio 14b43053-f75e-49c7-bb1b-848b8d1fe63e: Could not enable error recovery for the device

- After disabling ramfb in engine db, the VM succeeds to run with more than 1 vGPU instance (see engine.log 2022-04-28 10:46:44,209+03 INFO  [org.ovirt.engine.core.bll.RunVmCommand])

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.0.4-0.1.el8ev
vdsm-4.50.0.13-1.el8ev.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64
libvirt-daemon-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64
Nvidia drivers 14.0 GA (NVIDIA-vGPU-rhel-8.5-510.47.03.x86_64)

How reproducible:
100%

Steps to Reproduce:
1. Run VM with 2 vGPU instances.
2.
3.

Actual results:
VM failed to run

Expected results:
VM should support eunning multiple vGPU devices.

Additional info:
vdsm.log, engine.log and libvirt/qemu/vm.log attached.

Comment 4 Guo, Zhiyi 2022-04-28 08:52:03 UTC
@Gerd, could you help to check this? Do we support to use a VM with multiple ramfb devices?

Comment 5 Nisim Simsolo 2022-04-28 11:12:17 UTC
(In reply to Guo, Zhiyi from comment #4)
> @Gerd, could you help to check this? Do we support to use a VM with multiple
> ramfb devices?

It might be a regression because it worked for me after ramfb was enabled by default when using display=on (see https://bugzilla.redhat.com/show_bug.cgi?id=1958081).

Comment 6 Gerd Hoffmann 2022-04-28 11:18:47 UTC
(In reply to Guo, Zhiyi from comment #4)
> @Gerd, could you help to check this? Do we support to use a VM with multiple
> ramfb devices?

No.  Only one of the two vgpu instances should have ramfb enabled.

Comment 7 Jaroslav Suchanek 2022-04-28 11:25:23 UTC
Jonathon, please triage this one. Thanks.

Comment 12 Arik 2022-04-29 06:53:51 UTC
Thanks Greg
Lucia, sounds like we can leverage the ability to specify the parameters per-device now. What do you think about adding the notion of ramfb to the vgpu dialog and ensure it is specified in a single device? (I'd still enable it by default as we do now for the first device that is set with display but allow the user to switch it to a difference device with display=on)

Comment 14 Lucia Jelinkova 2022-04-29 10:41:38 UTC
Arik, in the UI, it is not possible to specify the parameters per device - all configuration is common for all devices. We would need to redesign the whole dialog to enable configuration per device.

We can either set ramfb=true only on one device in LibvirtVmXmlBuilder or we can make it configurable in the UI but also set it only on 1 device.

Comment 15 Arik 2022-05-01 07:40:48 UTC
(In reply to Lucia Jelinkova from comment #14)
> Arik, in the UI, it is not possible to specify the parameters per device -
> all configuration is common for all devices. We would need to redesign the
> whole dialog to enable configuration per device.

Yeah, I know we didn't change the UI to leverage the ability of setting parameters per device which was added to the backend
I'm not suggesting to change it for every configuration, it still sounds like a big effort - but only for this one (ramfb) which is not mentioned on the UI side at the moment
 
> We can either set ramfb=true only on one device in LibvirtVmXmlBuilder or we
> can make it configurable in the UI but also set it only on 1 device.

Setting it on one device in LibvirtVmXmlBuilder is a possible workaround but it's not that great to do that behind the scenes, without the user noticing that
I'd prefer the latter - when display=on is set in the dialog, there would be one device that is set with ramfb by default, the user would be able to change that or switch that setting to another device so eventually the UI will either set it on a single device or not set it at all

Comment 17 Lucia Jelinkova 2022-05-02 09:00:08 UTC
I agree that we can add a new ramfb option to the dialog but AFAIK there is no way how to distinguish the devices (they are all the same as far as the UI is concerned) so we will always just pick a random 1 device and set that option there.

Comment 18 Jaroslav Suchanek 2022-05-02 11:28:13 UTC
Fix proposal by Jonathon sent to libvirt upstream mailing list:
https://listman.redhat.com/archives/libvir-list/2022-April/230616.html

Comment 19 Jonathon Jongsma 2022-05-02 14:51:47 UTC
As Jarda mentioned, I did send a patch upstream to libvirt to enforce this restriction. But that's mostly a user experience improvement. It will warn you that only one vgpu can have ramfb set instead of failing with a more obscure error when you attempt to start qemu. The bigger issue is that the management software is configuring things incorrectly. So it seems to me that perhaps this bug should be transferred back to a different software component to fix the issue being discussed by @ljelinko and @ahadas

Comment 20 Arik 2022-05-02 16:15:22 UTC
Yeah, I agree
We discussed it offline and seems that it would be best to pick a random vgpu device and set it with ramfb in LibvirtVmXmlBuilder

Comment 21 Nisim Simsolo 2022-06-19 16:48:09 UTC
Verified:
ovirt-engine-4.5.1.2-0.11.el8ev
vdsm-4.50.1.3-1.el8ev.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+15489+bc23efef.1.x86_64
libvirt-8.0.0-5.2.module+el8.6.0+15256+3a0914fe.x86_64
NVIDIA-vGPU-rhel-8.6-510.73.06.x86_64

Verification scenario:
1. Run VM with maximum vGPU instances available.
2. Verify VM is running
   Verifiy console is showing Initial boot (TianoCore and boot start).
   After VM is booted, verify console is showing OS UI and nvidia drivers are running.
   Observe VM domxml and verify ramfd is enabled only on one of the mdevs, for example:
 
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on' ramfb='on'>
      <source>
        <address uuid='639a9969-087b-4f4d-9806-506716125e6b'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
      <source>
        <address uuid='bfcafdde-5aa7-4f3f-85ab-0713c1fa51ac'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
      <source>
        <address uuid='e374af9c-04fa-4bd1-868e-d406fab0447d'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
      <source>
        <address uuid='e88eb4ff-1659-4918-9137-c7d85f091c46'/>
      </source>
      <alias name='hostdev3'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </hostdev>

3. Power off VM and repeat steps 1-2 few more times.

Comment 22 Sandro Bonazzola 2022-06-23 05:54:58 UTC
This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.