Bug 1852718 - vGPU: VM failed to run with mdev_type instance
Summary: vGPU: VM failed to run with mdev_type instance
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.4.1.2
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.4.4
: ---
Assignee: Milan Zamazal
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On: 1846343 1852433 1877675
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-01 07:02 UTC by Nisim Simsolo
Modified: 2020-12-21 12:36 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-21 12:36:19 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+
aoconnor: blocker-


Attachments (Terms of Use)
VM QEMU log (31.68 KB, text/plain)
2020-07-01 07:03 UTC, Nisim Simsolo
no flags Details
engine.log (6.59 MB, text/plain)
2020-07-01 07:04 UTC, Nisim Simsolo
no flags Details
vdsm.log (3.52 MB, text/plain)
2020-07-01 07:05 UTC, Nisim Simsolo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5250621 0 None None None 2020-08-17 20:09:51 UTC

Description Nisim Simsolo 2020-07-01 07:02:54 UTC
Description of problem:
See https://bugzilla.redhat.com/show_bug.cgi?id=1846343 for more details.

After adding Nvidia vGPU instance using WebAdmin -> VM -> host devices -> manage vGPU button 
or using edit VM -> custom properties -> mdev_type, 
the VM failed to run with the next vdsm.log errors:
 
2020-06-11 15:04:14,007+0300 ERROR (vm/6099c96f) [virt.vm] (vmId='6099c96f-d79d-47ae-b39f-9489bc552cf0') The vm start process failed (vm:871)
Traceback (most recent call last):
.
.
libvirt.libvirtError: internal error: Process exited prior to exec: libvirt:  error : failed to access '/sys/bus/mdev/devices/e1f27070-b062-4ea3-a689-89e37a56f677/iommu_group': No such file or directory

2020-06-11 15:04:18,533+0300 ERROR (jsonrpc/1) [root] Couldn't parse NVDIMM device data (hostdev:755)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/hostdev.py", line 753, in list_nvdimms
    data = json.loads(output)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
--------------------------

vGPU Nvidia drivers are installed and Nvidia service is running.
also, it is possible to see vGPU instances in the host, for example:
# /home/nsimsolo/vgpu_instances1.sh 
mdev_type: nvidia-11 --- description: num_heads=2, frl_config=45, framebuffer=512M, max_resolution=2560x1600, max_instance=16 --- name: GRID M60-0B
mdev_type: nvidia-12 --- description: num_heads=2, frl_config=60, framebuffer=512M, max_resolution=2560x1600, max_instance=16 --- name: GRID M60-0Q
mdev_type: nvidia-13 --- description: num_heads=1, frl_config=60, framebuffer=1024M, max_resolution=1280x1024, max_instance=8 --- name: GRID M60-1A
mdev_type: nvidia-14 --- description: num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=8 --- name: GRID M60-1B
----------------

This issue is not related to emulated machine type (issue occured on pc-i440fx and Q35)

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.1.2-0.10.el8ev
vdsm-4.40.19-1.el8ev.x86_64
libvirt-daemon-6.0.0-22.module+el8.2.1+6815+1c792dc8.x86_64
qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64
Nvidia host drivers (Tesla M60): NVIDIA-vGPU-rhel-8.2-450.36.01.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Browse Webadmin -> click on VM name -> host devices tab -> manage vGPU, select Nvidia instane and click "save" button.
2. Run VM
3.

Actual results:
VM failed to run 

Expected results:
VM should run with attached vGPU device.

Additional info:
vdsm.log and engine.log attached

Comment 1 Nisim Simsolo 2020-07-01 07:03:59 UTC
Created attachment 1699422 [details]
VM QEMU log

Comment 2 Nisim Simsolo 2020-07-01 07:04:37 UTC
Created attachment 1699423 [details]
engine.log

Comment 3 Nisim Simsolo 2020-07-01 07:05:13 UTC
Created attachment 1699424 [details]
vdsm.log

Comment 4 Milan Zamazal 2020-07-01 07:43:55 UTC
The problem is that vfio_mdev module is not loaded in initramfs as discussed in Bug 1846343.

Comment 5 Michal Skrivanek 2020-07-01 11:09:30 UTC
setting severity to medium because of a possible workaround (see related bug)

Comment 9 RHEL Program Management 2020-10-28 15:31:18 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 10 Arik 2020-10-28 16:04:28 UTC
Additional information that was discussed elsewhere:
NVIDIA drivers for RHEL 8.3 will be released after the planned release date of 4.4.3 (thanks Nisim for checking that).
The platform fix for this bug was not backported to RHEL 8.2.

The implication is that we cannot verify this bz in the 4.4.3 time-frame and NVIDIA vGPU won't work in 4.5 cluster level until the aforementioned drivers are released.
However, it should keep working in 4.4.3 with 4.4 cluster level (RHEL 8.2 hosts) + the proposed workaround (https://bugzilla.redhat.com/show_bug.cgi?id=1846343#c18 or https://bugzilla.redhat.com/show_bug.cgi?id=1846343#c24).

Comment 11 Arik 2020-11-15 08:55:10 UTC
See comment 10

Comment 12 Nisim Simsolo 2020-12-08 13:13:49 UTC
Verified:
vdsm-4.40.39-1.el8ev.x86_64
ovirt-engine-4.4.4.3-0.5.el8ev
qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.x86_64
libvirt-daemon-6.6.0-7.1.module+el8.3.0+8852+b44fca9f.x86_64
Nvidia drivers for host and VM: grid12.0_beta
host: NVIDIA-vGPU-rhel-8.3-460.26.x86_64 
VM: NVIDIA-Linux-x86_64-460.26-grid.run

Comment 13 Sandro Bonazzola 2020-12-21 12:36:19 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.