Bug 1905417 - vGPU: VM failed to run with mdev_type instance (java NPE in engine.log)
Summary: vGPU: VM failed to run with mdev_type instance (java NPE in engine.log)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.4.4.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.4
: 4.4.4.4
Assignee: Liran Rotenberg
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-08 09:59 UTC by Nisim Simsolo
Modified: 2021-05-10 02:12 UTC (History)
5 users (show)

Fixed In Version: ovirt-engine-4.4.4.4
Clone Of:
Environment:
Last Closed: 2021-01-12 16:23:55 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+
pm-rhel: planning_ack+
ahadas: devel_ack+
pm-rhel: testing_ack+


Attachments (Terms of Use)
engine.log (54.29 KB, application/x-xz)
2020-12-08 10:13 UTC, Nisim Simsolo
no flags Details
vdsm.log (623.05 KB, application/x-xz)
2020-12-08 10:14 UTC, Nisim Simsolo
no flags Details
VM QEMU log (30.62 KB, text/plain)
2020-12-08 10:15 UTC, Nisim Simsolo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6026511 0 None None None 2021-05-10 02:12:15 UTC
oVirt gerrit 112568 0 master MERGED vdsbroker: fix NPE on mdev device 2021-02-16 18:07:31 UTC

Description Nisim Simsolo 2020-12-08 09:59:56 UTC
Description of problem:
When running VM with vGPU instance, after powering VM off and running it again, VM failed to run with java NPE in engine.log:
2020-12-08 11:53:40,254+02 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-47374) [] Failed to create VM: java.lang.NullPointerException

- A workaround for this issue is to edit VM -> disable pinned to host -> save changes. edit VM -> pin VM to host -> save changes. and run VM again.

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.4.3-0.5.el8ev
qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.x86_64
vdsm-4.40.39-1.el8ev.x86_64
libvirt-daemon-6.6.0-7.1.module+el8.3.0+8852+b44fca9f.x86_64
Nvidia drivers for host and VM: grid12.0_beta
host: NVIDIA-vGPU-rhel-8.3-460.26.x86_64 
VM: NVIDIA-Linux-x86_64-460.26-grid.run

How reproducible:
inconsistently

Steps to Reproduce:
1. Run VM with vGPU instance, install Nvidia drivers on the VM.
2. Power off VM and run VM again.
3.

Actual results:
VM failed to run.

Expected results:
VM should run with Nvidia instance.

Additional info:
vdsm.log and engine.log (2020-12-08 11:53:40,254+02 ERROR) attached.

Comment 1 Nisim Simsolo 2020-12-08 10:13:38 UTC
Created attachment 1737561 [details]
engine.log

Comment 2 Nisim Simsolo 2020-12-08 10:14:00 UTC
Created attachment 1737562 [details]
vdsm.log

Comment 3 Nisim Simsolo 2020-12-08 10:15:28 UTC
Created attachment 1737563 [details]
VM QEMU log

Comment 4 Liran Rotenberg 2020-12-08 13:34:13 UTC
The VM failed to start due to NPE getting the host device capability.

When we try to write the max memory value to the domain we calculate the NVDIMM, if exists.
For that we pass on the host devices and search for it:
(VmInfoBuildUtils::getNvdimmTotalSize)
if (hostDevice.getCapability().equals("nvdimm"))

But if the capability is null, we will get that NPE.
From the engine log it looks like the problematic one is `hostdev0`, which is the MDEV host device.

The question is whether it's OK not having capability for MDEV or not. Or, if some kernel modules are not loaded which can cause it.

Comment 5 Liran Rotenberg 2020-12-09 09:10:08 UTC
Thanks to Nisim I could debug his environment.

The problem is that mdev is an host device but the engine doesn't know of.
It is on the VM devices and when trying to do:
HostDevice hostDevice = hostDevicesSupplier.get().get(device.getDevice());
The hostDevice will be null.

We should check the device exists before checking if it's nvdimm.

Comment 6 Nisim Simsolo 2020-12-28 09:56:49 UTC
Verified:
ovirt-engine-4.4.4.5-0.10.el8ev
vdsm-4.40.40-1.el8ev.x86_64
qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.x86_64
libvirt-daemon-6.6.0-7.1.module+el8.3.0+8852+b44fca9f.x86_64

Nvidia drivers for host and VM: grid12.0_beta
host: NVIDIA-vGPU-rhel-8.3-460.26.x86_64 
VM: NVIDIA-Linux-x86_64-460.26-grid.run

Comment 7 Sandro Bonazzola 2021-01-12 16:23:55 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.