Bug 1905417

Summary: vGPU: VM failed to run with mdev_type instance (java NPE in engine.log)
Product: [oVirt] ovirt-engine Reporter: Nisim Simsolo <nsimsolo>
Component: BLL.VirtAssignee: Liran Rotenberg <lrotenbe>
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.4.3CC: ahadas, bugs, gveitmic, lrotenbe, nsimsolo
Target Milestone: ovirt-4.4.4Flags: pm-rhel: ovirt-4.4+
pm-rhel: planning_ack+
ahadas: devel_ack+
pm-rhel: testing_ack+
Target Release: 4.4.4.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.4.4 Doc Type: Bug Fix
Doc Text:
Previously, running a VM that has MDEV device would result in NullPointerException. Now, the VM will boot as expected without any error.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-12 16:23:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log
none
vdsm.log
none
VM QEMU log none

Description Nisim Simsolo 2020-12-08 09:59:56 UTC
Description of problem:
When running VM with vGPU instance, after powering VM off and running it again, VM failed to run with java NPE in engine.log:
2020-12-08 11:53:40,254+02 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-47374) [] Failed to create VM: java.lang.NullPointerException

- A workaround for this issue is to edit VM -> disable pinned to host -> save changes. edit VM -> pin VM to host -> save changes. and run VM again.

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.4.3-0.5.el8ev
qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.x86_64
vdsm-4.40.39-1.el8ev.x86_64
libvirt-daemon-6.6.0-7.1.module+el8.3.0+8852+b44fca9f.x86_64
Nvidia drivers for host and VM: grid12.0_beta
host: NVIDIA-vGPU-rhel-8.3-460.26.x86_64 
VM: NVIDIA-Linux-x86_64-460.26-grid.run

How reproducible:
inconsistently

Steps to Reproduce:
1. Run VM with vGPU instance, install Nvidia drivers on the VM.
2. Power off VM and run VM again.
3.

Actual results:
VM failed to run.

Expected results:
VM should run with Nvidia instance.

Additional info:
vdsm.log and engine.log (2020-12-08 11:53:40,254+02 ERROR) attached.

Comment 1 Nisim Simsolo 2020-12-08 10:13:38 UTC
Created attachment 1737561 [details]
engine.log

Comment 2 Nisim Simsolo 2020-12-08 10:14:00 UTC
Created attachment 1737562 [details]
vdsm.log

Comment 3 Nisim Simsolo 2020-12-08 10:15:28 UTC
Created attachment 1737563 [details]
VM QEMU log

Comment 4 Liran Rotenberg 2020-12-08 13:34:13 UTC
The VM failed to start due to NPE getting the host device capability.

When we try to write the max memory value to the domain we calculate the NVDIMM, if exists.
For that we pass on the host devices and search for it:
(VmInfoBuildUtils::getNvdimmTotalSize)
if (hostDevice.getCapability().equals("nvdimm"))

But if the capability is null, we will get that NPE.
From the engine log it looks like the problematic one is `hostdev0`, which is the MDEV host device.

The question is whether it's OK not having capability for MDEV or not. Or, if some kernel modules are not loaded which can cause it.

Comment 5 Liran Rotenberg 2020-12-09 09:10:08 UTC
Thanks to Nisim I could debug his environment.

The problem is that mdev is an host device but the engine doesn't know of.
It is on the VM devices and when trying to do:
HostDevice hostDevice = hostDevicesSupplier.get().get(device.getDevice());
The hostDevice will be null.

We should check the device exists before checking if it's nvdimm.

Comment 6 Nisim Simsolo 2020-12-28 09:56:49 UTC
Verified:
ovirt-engine-4.4.4.5-0.10.el8ev
vdsm-4.40.40-1.el8ev.x86_64
qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.x86_64
libvirt-daemon-6.6.0-7.1.module+el8.3.0+8852+b44fca9f.x86_64

Nvidia drivers for host and VM: grid12.0_beta
host: NVIDIA-vGPU-rhel-8.3-460.26.x86_64 
VM: NVIDIA-Linux-x86_64-460.26-grid.run

Comment 7 Sandro Bonazzola 2021-01-12 16:23:55 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.