Bug 1481007 - vGPU: VMs with mdev_type hook failed to run after RHV upgrade, even if the hook removed.
vGPU: VMs with mdev_type hook failed to run after RHV upgrade, even if the ho...
Status: VERIFIED
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt (Show other bugs)
4.2.0
Unspecified Unspecified
high Severity urgent (vote)
: ovirt-4.2.0
: ---
Assigned To: Arik
meital avital
:
Depends On:
Blocks: 1486524
  Show dependency treegraph
 
Reported: 2017-08-13 08:40 EDT by Nisim Simsolo
Modified: 2017-10-03 10:15 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.2+


Attachments (Terms of Use)
VM_devices screeshot (188.02 KB, image/png)
2017-08-13 08:47 EDT, Nisim Simsolo
no flags Details
vdsm.log (325.63 KB, application/x-xz)
2017-08-13 08:48 EDT, Nisim Simsolo
no flags Details
engine.log (461.73 KB, application/x-xz)
2017-08-13 08:48 EDT, Nisim Simsolo
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 82211 master MERGED core: currently only managed host devices are monitored 2017-09-27 09:41 EDT
oVirt gerrit 82212 master MERGED core: monitor unmanaged vm host devices 2017-09-27 09:41 EDT
oVirt gerrit 82213 master MERGED core: fix NPE when running vm with mdev device 2017-09-27 10:43 EDT
oVirt gerrit 82214 master MERGED core: refresh vm devices after skipping unmanaged host device 2017-09-27 10:43 EDT

  None (edit)
Description Nisim Simsolo 2017-08-13 08:40:18 EDT
Description of problem:
VMs with mdev_type configuration failed to run After upgrading RHV setup to ovirt-engine-4.2.0-0.0.master.20170811144920.gita423008.el7.centos. (from ovirt-engine-4.2.0-0.0.master.20170728194615.gitec6aa15.el7.centos)
Also, when removing mdev_type custom properties, the VM still failed to run.
Observing engine database for specific VM "vm_device" shows 15 devices with the same uuid, for example: 
engine=# select * from vm_device where vm_id='7feae268-6669-4ac4-920f-7177a43d7acd';
              device_id               |                vm_id                 |    type    |    device     |                           address                            |       s
pec_params        | is_managed | is_plugged | is_readonly |         _create_date          |         _update_date          |     alias      | custom_properties | snapshot_id | log
ical_name | host_device 
--------------------------------------+--------------------------------------+------------+---------------+--------------------------------------------------------------+--------
------------------+------------+------------+-------------+-------------------------------+-------------------------------+----------------+-------------------+-------------+----
----------+-------------
 93934a11-d492-4995-828c-98c61a920772 | 7feae268-6669-4ac4-920f-7177a43d7acd | graphics   | spice         |                                                              |        
                  | t          | t          | f           | 2017-07-31 13:35:47.691817+03 |                               |                |                   |             |    
          | 
 b35aaa5e-b29b-4b7e-834c-2cbc63d1e158 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 13:55:09.336491+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 bdd3c6da-0230-49b4-bf68-b7dcedcf6a77 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-08-02 11:16:16.144596+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 0301cf85-d155-4457-b288-a458b1b817e1 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 13:49:45.775929+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 080a73c7-a42c-46ac-a758-52f97f9fc3bd | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 12:04:10.825867+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 0cff3417-4ef3-42eb-b034-b088265cd134 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 14:23:12.984532+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 


Version-Release number of selected component (if applicable):
vdsm-4.20.2-60.git06231e5.el7.centos.x86_64 (upgraded from    vdsm-4.20.2-33.gite85019b.el7.centos.x86_64)
libvirt-client-3.2.0-14.el7_4.2.x86_64
qemu-kvm-rhev-2.9.0-14.el7.x86_64
vdsm-hook-vfio-mdev-4.20.2-60.git06231e5.el7.centos.noarch (upgraded from vdsm-4.20.2-33.gite85019b.el7.centos.x86_64)
Nvidia drivers: GRIDSW_5.0 Beta Release KVM Drivers (R384)
engine kernel: kernel-3.10.0-693.el7.x86_64
Host kernel: kernel-3.10.0-693.1.1.el7.x86_64 (upgraded from kernel-3.10.0-693.el7.x86_64)

How reproducible:
100%

Steps to Reproduce:
1. Create VM with mdev_type hook, install Grid drivers on this VM and verify Nvidia drivers are running properly on that VM GPU.
2. Upgrade RHV environment (engine and host)
3. Try to run VM after upgrade.
4. Remove VM hook and try to run VM.
5. Run VM that was not with mdev_type hook before upgrade. 

Actual results:
3-4. In both cases VM failed to run.
5. VM is running properly.

Expected results:
VM should not fail to run.

Additional info:
engine.log (2017-08-13 13:37:41,930+0300 ERROR) and vdsm.log attached (see screenshot of too many mdev devices with same uuid attached.
Comment 1 Nisim Simsolo 2017-08-13 08:47 EDT
Created attachment 1312676 [details]
VM_devices screeshot
Comment 2 Nisim Simsolo 2017-08-13 08:48 EDT
Created attachment 1312677 [details]
vdsm.log
Comment 3 Nisim Simsolo 2017-08-13 08:48 EDT
Created attachment 1312678 [details]
engine.log
Comment 5 Arik 2017-08-23 11:32:31 EDT
Something is wrong with the vdsm log attached. I can't find messages related to the VM '7feae268-6669-4ac4-920f-7177a43d7acd'. Anyway, in this case, it would be best to look at the system while it happens. Can you please try to reproduce it and call me to have a look?
Comment 6 Nisim Simsolo 2017-08-24 04:31:23 EDT
I have an "old" VM with more than 1 mdev type device that failed to run. Please contact me when you can.
Comment 7 Arik 2017-08-29 04:40:14 EDT
Nisim, so it doesn't happen anymore on the master branch. Could you verify that it happens in 4.1?
Thanks.
Comment 8 Nisim Simsolo 2017-09-19 10:25:02 EDT
It happens (mdev type is added to vm_devices with the same uuid) in 2 cases:
1. When removing mdev_type hook and adding another one.
2. After upgrading setup from 4.1.5-2 to 4.1.6-4

In both cases VM is running properly with Nvidia instance attached to VM.
Comment 9 Tomas Jelinek 2017-09-25 04:59:55 EDT
targeting this to 4.2 since in 4.1 the issue is only that the vm devices subtab can show more devices than the VM actually has.
Comment 10 Nisim Simsolo 2017-10-03 10:15:53 EDT
Verification builds: 
ovirt-engine-4.2.0-0.0.master.20171002190603.git3015ada.el7.centos
libvirt-client-3.2.0-14.el7_4.3.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.8.x86_64
vdsm-4.20.3-128.git52f2c60.el7.centos.x86_64
vdsm-hook-vfio-mdev-4.20.3-128.git52f2c60.el7.centos
NVIDIA-Linux-x86_64-384.37-vgpu-kvm

Verification scenario:
1. Run VM with mdev_type hook.
2. Upgrade setup.
3. Verify VM is still running. verify only 1 mdev device is listed undev VM -> VM devices.
4. Power off and run VM again.
5. Verify Vm is running properly with Nvidia instance.
6. Import VM with multiple mdev VM devices from export domain (I've exported such problematic VM to export domain when this bug created).
7. Run VM and verify VM is running properly with Nvidia instance. Browse webadmin -> Virtual machines -> select imported VM -> VM devices, verify only 1 mdev device is now listed (the multiple old ones are actually removed).

Note You need to log in before you can comment on or make changes to this bug.