Bug 1481007 - vGPU: VMs with mdev_type hook failed to run after RHV upgrade, even if the hook removed.
Summary: vGPU: VMs with mdev_type hook failed to run after RHV upgrade, even if the ho...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.2.0
: ---
Assignee: Arik
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks: 1486524
TreeView+ depends on / blocked
 
Reported: 2017-08-13 12:40 UTC by Nisim Simsolo
Modified: 2019-04-28 13:08 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-12-20 11:37:05 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
VM_devices screeshot (188.02 KB, image/png)
2017-08-13 12:47 UTC, Nisim Simsolo
no flags Details
vdsm.log (325.63 KB, application/x-xz)
2017-08-13 12:48 UTC, Nisim Simsolo
no flags Details
engine.log (461.73 KB, application/x-xz)
2017-08-13 12:48 UTC, Nisim Simsolo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 82211 0 master MERGED core: currently only managed host devices are monitored 2017-09-27 13:41:50 UTC
oVirt gerrit 82212 0 master MERGED core: monitor unmanaged vm host devices 2017-09-27 13:41:53 UTC
oVirt gerrit 82213 0 master MERGED core: fix NPE when running vm with mdev device 2017-09-27 14:43:02 UTC
oVirt gerrit 82214 0 master MERGED core: refresh vm devices after skipping unmanaged host device 2017-09-27 14:43:04 UTC

Description Nisim Simsolo 2017-08-13 12:40:18 UTC
Description of problem:
VMs with mdev_type configuration failed to run After upgrading RHV setup to ovirt-engine-4.2.0-0.0.master.20170811144920.gita423008.el7.centos. (from ovirt-engine-4.2.0-0.0.master.20170728194615.gitec6aa15.el7.centos)
Also, when removing mdev_type custom properties, the VM still failed to run.
Observing engine database for specific VM "vm_device" shows 15 devices with the same uuid, for example: 
engine=# select * from vm_device where vm_id='7feae268-6669-4ac4-920f-7177a43d7acd';
              device_id               |                vm_id                 |    type    |    device     |                           address                            |       s
pec_params        | is_managed | is_plugged | is_readonly |         _create_date          |         _update_date          |     alias      | custom_properties | snapshot_id | log
ical_name | host_device 
--------------------------------------+--------------------------------------+------------+---------------+--------------------------------------------------------------+--------
------------------+------------+------------+-------------+-------------------------------+-------------------------------+----------------+-------------------+-------------+----
----------+-------------
 93934a11-d492-4995-828c-98c61a920772 | 7feae268-6669-4ac4-920f-7177a43d7acd | graphics   | spice         |                                                              |        
                  | t          | t          | f           | 2017-07-31 13:35:47.691817+03 |                               |                |                   |             |    
          | 
 b35aaa5e-b29b-4b7e-834c-2cbc63d1e158 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 13:55:09.336491+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 bdd3c6da-0230-49b4-bf68-b7dcedcf6a77 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-08-02 11:16:16.144596+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 0301cf85-d155-4457-b288-a458b1b817e1 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 13:49:45.775929+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 080a73c7-a42c-46ac-a758-52f97f9fc3bd | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 12:04:10.825867+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 0cff3417-4ef3-42eb-b034-b088265cd134 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 14:23:12.984532+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 


Version-Release number of selected component (if applicable):
vdsm-4.20.2-60.git06231e5.el7.centos.x86_64 (upgraded from    vdsm-4.20.2-33.gite85019b.el7.centos.x86_64)
libvirt-client-3.2.0-14.el7_4.2.x86_64
qemu-kvm-rhev-2.9.0-14.el7.x86_64
vdsm-hook-vfio-mdev-4.20.2-60.git06231e5.el7.centos.noarch (upgraded from vdsm-4.20.2-33.gite85019b.el7.centos.x86_64)
Nvidia drivers: GRIDSW_5.0 Beta Release KVM Drivers (R384)
engine kernel: kernel-3.10.0-693.el7.x86_64
Host kernel: kernel-3.10.0-693.1.1.el7.x86_64 (upgraded from kernel-3.10.0-693.el7.x86_64)

How reproducible:
100%

Steps to Reproduce:
1. Create VM with mdev_type hook, install Grid drivers on this VM and verify Nvidia drivers are running properly on that VM GPU.
2. Upgrade RHV environment (engine and host)
3. Try to run VM after upgrade.
4. Remove VM hook and try to run VM.
5. Run VM that was not with mdev_type hook before upgrade. 

Actual results:
3-4. In both cases VM failed to run.
5. VM is running properly.

Expected results:
VM should not fail to run.

Additional info:
engine.log (2017-08-13 13:37:41,930+0300 ERROR) and vdsm.log attached (see screenshot of too many mdev devices with same uuid attached.

Comment 1 Nisim Simsolo 2017-08-13 12:47:38 UTC
Created attachment 1312676 [details]
VM_devices screeshot

Comment 2 Nisim Simsolo 2017-08-13 12:48:03 UTC
Created attachment 1312677 [details]
vdsm.log

Comment 3 Nisim Simsolo 2017-08-13 12:48:29 UTC
Created attachment 1312678 [details]
engine.log

Comment 5 Arik 2017-08-23 15:32:31 UTC
Something is wrong with the vdsm log attached. I can't find messages related to the VM '7feae268-6669-4ac4-920f-7177a43d7acd'. Anyway, in this case, it would be best to look at the system while it happens. Can you please try to reproduce it and call me to have a look?

Comment 6 Nisim Simsolo 2017-08-24 08:31:23 UTC
I have an "old" VM with more than 1 mdev type device that failed to run. Please contact me when you can.

Comment 7 Arik 2017-08-29 08:40:14 UTC
Nisim, so it doesn't happen anymore on the master branch. Could you verify that it happens in 4.1?
Thanks.

Comment 8 Nisim Simsolo 2017-09-19 14:25:02 UTC
It happens (mdev type is added to vm_devices with the same uuid) in 2 cases:
1. When removing mdev_type hook and adding another one.
2. After upgrading setup from 4.1.5-2 to 4.1.6-4

In both cases VM is running properly with Nvidia instance attached to VM.

Comment 9 Tomas Jelinek 2017-09-25 08:59:55 UTC
targeting this to 4.2 since in 4.1 the issue is only that the vm devices subtab can show more devices than the VM actually has.

Comment 10 Nisim Simsolo 2017-10-03 14:15:53 UTC
Verification builds: 
ovirt-engine-4.2.0-0.0.master.20171002190603.git3015ada.el7.centos
libvirt-client-3.2.0-14.el7_4.3.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.8.x86_64
vdsm-4.20.3-128.git52f2c60.el7.centos.x86_64
vdsm-hook-vfio-mdev-4.20.3-128.git52f2c60.el7.centos
NVIDIA-Linux-x86_64-384.37-vgpu-kvm

Verification scenario:
1. Run VM with mdev_type hook.
2. Upgrade setup.
3. Verify VM is still running. verify only 1 mdev device is listed undev VM -> VM devices.
4. Power off and run VM again.
5. Verify Vm is running properly with Nvidia instance.
6. Import VM with multiple mdev VM devices from export domain (I've exported such problematic VM to export domain when this bug created).
7. Run VM and verify VM is running properly with Nvidia instance. Browse webadmin -> Virtual machines -> select imported VM -> VM devices, verify only 1 mdev device is now listed (the multiple old ones are actually removed).

Comment 11 Sandro Bonazzola 2017-12-20 11:37:05 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.