Description of problem: When running VM with vGPU instance, after powering VM off and running it again, VM failed to run with java NPE in engine.log: 2020-12-08 11:53:40,254+02 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-47374) [] Failed to create VM: java.lang.NullPointerException - A workaround for this issue is to edit VM -> disable pinned to host -> save changes. edit VM -> pin VM to host -> save changes. and run VM again. Version-Release number of selected component (if applicable): ovirt-engine-4.4.4.3-0.5.el8ev qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.x86_64 vdsm-4.40.39-1.el8ev.x86_64 libvirt-daemon-6.6.0-7.1.module+el8.3.0+8852+b44fca9f.x86_64 Nvidia drivers for host and VM: grid12.0_beta host: NVIDIA-vGPU-rhel-8.3-460.26.x86_64 VM: NVIDIA-Linux-x86_64-460.26-grid.run How reproducible: inconsistently Steps to Reproduce: 1. Run VM with vGPU instance, install Nvidia drivers on the VM. 2. Power off VM and run VM again. 3. Actual results: VM failed to run. Expected results: VM should run with Nvidia instance. Additional info: vdsm.log and engine.log (2020-12-08 11:53:40,254+02 ERROR) attached.
Created attachment 1737561 [details] engine.log
Created attachment 1737562 [details] vdsm.log
Created attachment 1737563 [details] VM QEMU log
The VM failed to start due to NPE getting the host device capability. When we try to write the max memory value to the domain we calculate the NVDIMM, if exists. For that we pass on the host devices and search for it: (VmInfoBuildUtils::getNvdimmTotalSize) if (hostDevice.getCapability().equals("nvdimm")) But if the capability is null, we will get that NPE. From the engine log it looks like the problematic one is `hostdev0`, which is the MDEV host device. The question is whether it's OK not having capability for MDEV or not. Or, if some kernel modules are not loaded which can cause it.
Thanks to Nisim I could debug his environment. The problem is that mdev is an host device but the engine doesn't know of. It is on the VM devices and when trying to do: HostDevice hostDevice = hostDevicesSupplier.get().get(device.getDevice()); The hostDevice will be null. We should check the device exists before checking if it's nvdimm.
Verified: ovirt-engine-4.4.4.5-0.10.el8ev vdsm-4.40.40-1.el8ev.x86_64 qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.x86_64 libvirt-daemon-6.6.0-7.1.module+el8.3.0+8852+b44fca9f.x86_64 Nvidia drivers for host and VM: grid12.0_beta host: NVIDIA-vGPU-rhel-8.3-460.26.x86_64 VM: NVIDIA-Linux-x86_64-460.26-grid.run
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.