Description of problem: See https://bugzilla.redhat.com/show_bug.cgi?id=1846343 for more details. After adding Nvidia vGPU instance using WebAdmin -> VM -> host devices -> manage vGPU button or using edit VM -> custom properties -> mdev_type, the VM failed to run with the next vdsm.log errors: 2020-06-11 15:04:14,007+0300 ERROR (vm/6099c96f) [virt.vm] (vmId='6099c96f-d79d-47ae-b39f-9489bc552cf0') The vm start process failed (vm:871) Traceback (most recent call last): . . libvirt.libvirtError: internal error: Process exited prior to exec: libvirt: error : failed to access '/sys/bus/mdev/devices/e1f27070-b062-4ea3-a689-89e37a56f677/iommu_group': No such file or directory 2020-06-11 15:04:18,533+0300 ERROR (jsonrpc/1) [root] Couldn't parse NVDIMM device data (hostdev:755) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/hostdev.py", line 753, in list_nvdimms data = json.loads(output) File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) -------------------------- vGPU Nvidia drivers are installed and Nvidia service is running. also, it is possible to see vGPU instances in the host, for example: # /home/nsimsolo/vgpu_instances1.sh mdev_type: nvidia-11 --- description: num_heads=2, frl_config=45, framebuffer=512M, max_resolution=2560x1600, max_instance=16 --- name: GRID M60-0B mdev_type: nvidia-12 --- description: num_heads=2, frl_config=60, framebuffer=512M, max_resolution=2560x1600, max_instance=16 --- name: GRID M60-0Q mdev_type: nvidia-13 --- description: num_heads=1, frl_config=60, framebuffer=1024M, max_resolution=1280x1024, max_instance=8 --- name: GRID M60-1A mdev_type: nvidia-14 --- description: num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=8 --- name: GRID M60-1B ---------------- This issue is not related to emulated machine type (issue occured on pc-i440fx and Q35) Version-Release number of selected component (if applicable): ovirt-engine-4.4.1.2-0.10.el8ev vdsm-4.40.19-1.el8ev.x86_64 libvirt-daemon-6.0.0-22.module+el8.2.1+6815+1c792dc8.x86_64 qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64 Nvidia host drivers (Tesla M60): NVIDIA-vGPU-rhel-8.2-450.36.01.x86_64 How reproducible: 100% Steps to Reproduce: 1. Browse Webadmin -> click on VM name -> host devices tab -> manage vGPU, select Nvidia instane and click "save" button. 2. Run VM 3. Actual results: VM failed to run Expected results: VM should run with attached vGPU device. Additional info: vdsm.log and engine.log attached
Created attachment 1699422 [details] VM QEMU log
Created attachment 1699423 [details] engine.log
Created attachment 1699424 [details] vdsm.log
The problem is that vfio_mdev module is not loaded in initramfs as discussed in Bug 1846343.
setting severity to medium because of a possible workaround (see related bug)
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Additional information that was discussed elsewhere: NVIDIA drivers for RHEL 8.3 will be released after the planned release date of 4.4.3 (thanks Nisim for checking that). The platform fix for this bug was not backported to RHEL 8.2. The implication is that we cannot verify this bz in the 4.4.3 time-frame and NVIDIA vGPU won't work in 4.5 cluster level until the aforementioned drivers are released. However, it should keep working in 4.4.3 with 4.4 cluster level (RHEL 8.2 hosts) + the proposed workaround (https://bugzilla.redhat.com/show_bug.cgi?id=1846343#c18 or https://bugzilla.redhat.com/show_bug.cgi?id=1846343#c24).
See comment 10
Verified: vdsm-4.40.39-1.el8ev.x86_64 ovirt-engine-4.4.4.3-0.5.el8ev qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.x86_64 libvirt-daemon-6.6.0-7.1.module+el8.3.0+8852+b44fca9f.x86_64 Nvidia drivers for host and VM: grid12.0_beta host: NVIDIA-vGPU-rhel-8.3-460.26.x86_64 VM: NVIDIA-Linux-x86_64-460.26-grid.run
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.