Bug 1987121

Summary: [RFE] Support enabling nVidia Unified Memory on mdev vGPU
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Milan Zamazal <mzamazal>
Status: CLOSED ERRATA QA Contact: Nisim Simsolo <nsimsolo>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.4.6CC: ahadas, ljelinko, lsurette, nsimsolo, srevivo, ycui
Target Milestone: ovirt-4.5.0Keywords: FutureFeature, ZStream
Target Release: 4.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.5.0.1 ovirt-engine-ui-extensions-1.3.2-1 Doc Type: Enhancement
Doc Text:
The vGPU editing dialog was enhanced with an option to set driver parameters. The driver parameters are are specified as an arbitrary text, which is passed to NVidia drivers as it is, e.g. "`enable_uvm=1`". The given text will be used for all the vGPUs of a given VM. The vGPU editing dialog was moved from the host devices tab to the VM devices tab. vGPU properties are no longer specified using mdev_type VM custom property. They are specified as VM devices now. This change is transparent when using the vGPU editing dialog. In the REST API, the vGPU properties can be manipulated using a newly introduced `.../vms/.../mediateddevices` endpoint. The new API permits setting "nodisplay" and driver parameters for each of the vGPUs individually, but note that this is not supported in the vGPU editing dialog where they can be set only to a single value common for all the vGPUs of a given VM.
Story Points: ---
Clone Of:
: 2000061 (view as bug list) Environment:
Last Closed: 2022-05-26 16:22:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2000061, 2052557    

Description Germano Veit Michel 2021-07-29 01:02:52 UTC
Description of problem:

Customer request to add support to optionally enable Unified Memory on mdev vGPUs. Similarly to what can be done on KVM here:
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-red-hat-el-kvm

The setting is per mdev
# echo "enable_uvm=1"  > /sys/bus/mdev/devices/cf46ed27-c42d-4697-8956-f200800c566f/nvidia/vgpu_params

And must be done before a VM starts with the mdev, otherwise the driver returns EPERM.

Given VDSM manages the mdev (create on VM start and remove on VM stop) it gets tricky to do it manually on RHV (options are VDSM hook on before_vm_start and/or possibly udev rule). Official support for this would be nice, as it is important for performance on some workloads.

Comment 2 RHEL Program Management 2021-07-29 01:09:00 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 3 Milan Zamazal 2021-08-02 12:49:06 UTC
I think we can add the functionality on the Vdsm side easily, as a part of mdev device setup/teardown. The only problem may be how to ensure the parameter is reset after each use, if cleanup is not performed or fails for some reason. But that shouldn't be a major obstacle hopefully.

Now the question is how to specify the setting from Engine. If I understand it right, the setting should be VM specific. Do we want to use a custom property as suggested in the hook based workaround or something different?

Comment 4 Germano Veit Michel 2021-08-02 23:08:43 UTC
I think if you do something generic that allows setting anything to vgpu_params is best, so customer can set whatever knob they need, not just the unified memory. Might prevent new RFEs and also cover any future knob nvidia adds to their drivers.

Comment 5 Milan Zamazal 2021-08-03 07:54:01 UTC
The generic approach, allowing the user to specify an arbitrary vgpu_params string to use, should be possible. We create the corresponding mdev devices on each VM run and remove them afterwards, so their plugin parameters should be reset, I suppose. There is also an option to clear the settings manually by writing a space to vgpu_params.

Comment 6 Lucia Jelinkova 2021-09-02 08:10:09 UTC
In UI, we can support the Unified Memory the same way as we do support the  "Secondary display adapter" - using a switch on a vGPU dialog.

As for the backend, we still use the custom properties to configure the mdev devices and that is becoming a problem as we plan to add more configuration there. We can add a special keyword as we have for "nodisplay", e.g. unifiedmemory, but I wouldn't add anything generic, like you've suggested.

If we plan adding new configurations or would like to support a generic configuration that would be just appended, we should refactor the backend. One possible way is to create a vm device and specify the parameters in spec_params field and drop support for custom properties.

Comment 7 Milan Zamazal 2022-03-15 16:16:29 UTC
This feature request spawned a larger vGPU parameter handling refactoring. Some patches are already in, here is a summary of what is still missing:

- vGPU dialog needs to be updated for new API.

- Update REST API operation doesn't work due to a permission problem (this doesn't block the vGPU dialog update but should be fixed for API completeness).

- mdev_type custom property support is still present in some pieces of code although the custom property is not supported anymore.

- There is no icon for mediated/vGPU VM devices.

Comment 8 Milan Zamazal 2022-03-15 17:14:38 UTC
(In reply to Milan Zamazal from comment #7)
> here is a summary of what is still missing:

And:

- Handling the old mdev_type property in the OVF reader.

- Writing the old mdev_type property in the OVF writer in cluster levels < 4.7.

Comment 12 Nisim Simsolo 2022-05-04 08:31:22 UTC
Verified:
ovirt-engine-4.5.0.5-0.7.el8ev
vdsm-4.50.0.13-1.el8ev.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64
libvirt-daemon-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64
Nvidia drivers 14.0 GA (NVIDIA-vGPU-rhel-8.5-510.47.03.x86_64)

Verification scenario:
Polarion test case added to RFE links

Comment 17 errata-xmlrpc 2022-05-26 16:22:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4711