Bug 1987121 - [RFE] Support enabling nVidia Unified Memory on mdev vGPU
Summary: [RFE] Support enabling nVidia Unified Memory on mdev vGPU
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.5.0
: 4.5.0
Assignee: Milan Zamazal
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks: 2000061 2052557
TreeView+ depends on / blocked
 
Reported: 2021-07-29 01:02 UTC by Germano Veit Michel
Modified: 2022-05-31 14:47 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.5.0.1 ovirt-engine-ui-extensions-1.3.2-1
Doc Type: Enhancement
Doc Text:
The vGPU editing dialog was enhanced with an option to set driver parameters. The driver parameters are are specified as an arbitrary text, which is passed to NVidia drivers as it is, e.g. "`enable_uvm=1`". The given text will be used for all the vGPUs of a given VM. The vGPU editing dialog was moved from the host devices tab to the VM devices tab. vGPU properties are no longer specified using mdev_type VM custom property. They are specified as VM devices now. This change is transparent when using the vGPU editing dialog. In the REST API, the vGPU properties can be manipulated using a newly introduced `.../vms/.../mediateddevices` endpoint. The new API permits setting "nodisplay" and driver parameters for each of the vGPUs individually, but note that this is not supported in the vGPU editing dialog where they can be set only to a single value common for all the vGPUs of a given VM.
Clone Of:
: 2000061 (view as bug list)
Environment:
Last Closed: 2022-05-26 16:22:29 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine-api-model pull 26 0 None Merged api-model: Add mediated devices 2022-03-15 16:16:29 UTC
Github oVirt ovirt-engine-ui-extensions pull 45 0 None open Update mdevs 2022-03-21 15:46:46 UTC
Github oVirt ovirt-engine pull 166 0 None Draft Followup patches to VM mediated devices 2022-03-16 12:14:14 UTC
Github oVirt ovirt-engine pull 183 0 None Merged Followup patches to VM mediated devices - cont 2022-03-27 15:00:27 UTC
Github oVirt ovirt-engine pull 190 0 None Merged Respect deprecated custom property mdev_type on add/update VM 2022-03-27 14:59:56 UTC
Github oVirt ovirt-engine pull 84 0 None Merged core: Store mediated devices as VM devices 2022-03-15 16:16:45 UTC
Github oVirt vdsm pull 92 0 None Merged virt: Add support for vGPU driver parameters 2022-03-15 16:16:49 UTC
Red Hat Product Errata RHSA-2022:4711 0 None None None 2022-05-26 16:22:55 UTC

Description Germano Veit Michel 2021-07-29 01:02:52 UTC
Description of problem:

Customer request to add support to optionally enable Unified Memory on mdev vGPUs. Similarly to what can be done on KVM here:
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-red-hat-el-kvm

The setting is per mdev
# echo "enable_uvm=1"  > /sys/bus/mdev/devices/cf46ed27-c42d-4697-8956-f200800c566f/nvidia/vgpu_params

And must be done before a VM starts with the mdev, otherwise the driver returns EPERM.

Given VDSM manages the mdev (create on VM start and remove on VM stop) it gets tricky to do it manually on RHV (options are VDSM hook on before_vm_start and/or possibly udev rule). Official support for this would be nice, as it is important for performance on some workloads.

Comment 2 RHEL Program Management 2021-07-29 01:09:00 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 3 Milan Zamazal 2021-08-02 12:49:06 UTC
I think we can add the functionality on the Vdsm side easily, as a part of mdev device setup/teardown. The only problem may be how to ensure the parameter is reset after each use, if cleanup is not performed or fails for some reason. But that shouldn't be a major obstacle hopefully.

Now the question is how to specify the setting from Engine. If I understand it right, the setting should be VM specific. Do we want to use a custom property as suggested in the hook based workaround or something different?

Comment 4 Germano Veit Michel 2021-08-02 23:08:43 UTC
I think if you do something generic that allows setting anything to vgpu_params is best, so customer can set whatever knob they need, not just the unified memory. Might prevent new RFEs and also cover any future knob nvidia adds to their drivers.

Comment 5 Milan Zamazal 2021-08-03 07:54:01 UTC
The generic approach, allowing the user to specify an arbitrary vgpu_params string to use, should be possible. We create the corresponding mdev devices on each VM run and remove them afterwards, so their plugin parameters should be reset, I suppose. There is also an option to clear the settings manually by writing a space to vgpu_params.

Comment 6 Lucia Jelinkova 2021-09-02 08:10:09 UTC
In UI, we can support the Unified Memory the same way as we do support the  "Secondary display adapter" - using a switch on a vGPU dialog.

As for the backend, we still use the custom properties to configure the mdev devices and that is becoming a problem as we plan to add more configuration there. We can add a special keyword as we have for "nodisplay", e.g. unifiedmemory, but I wouldn't add anything generic, like you've suggested.

If we plan adding new configurations or would like to support a generic configuration that would be just appended, we should refactor the backend. One possible way is to create a vm device and specify the parameters in spec_params field and drop support for custom properties.

Comment 7 Milan Zamazal 2022-03-15 16:16:29 UTC
This feature request spawned a larger vGPU parameter handling refactoring. Some patches are already in, here is a summary of what is still missing:

- vGPU dialog needs to be updated for new API.

- Update REST API operation doesn't work due to a permission problem (this doesn't block the vGPU dialog update but should be fixed for API completeness).

- mdev_type custom property support is still present in some pieces of code although the custom property is not supported anymore.

- There is no icon for mediated/vGPU VM devices.

Comment 8 Milan Zamazal 2022-03-15 17:14:38 UTC
(In reply to Milan Zamazal from comment #7)
> here is a summary of what is still missing:

And:

- Handling the old mdev_type property in the OVF reader.

- Writing the old mdev_type property in the OVF writer in cluster levels < 4.7.

Comment 12 Nisim Simsolo 2022-05-04 08:31:22 UTC
Verified:
ovirt-engine-4.5.0.5-0.7.el8ev
vdsm-4.50.0.13-1.el8ev.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64
libvirt-daemon-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64
Nvidia drivers 14.0 GA (NVIDIA-vGPU-rhel-8.5-510.47.03.x86_64)

Verification scenario:
Polarion test case added to RFE links

Comment 17 errata-xmlrpc 2022-05-26 16:22:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4711


Note You need to log in before you can comment on or make changes to this bug.