Description of problem: VMs with mdev_type configuration failed to run After upgrading RHV setup to ovirt-engine-4.2.0-0.0.master.20170811144920.gita423008.el7.centos. (from ovirt-engine-4.2.0-0.0.master.20170728194615.gitec6aa15.el7.centos) Also, when removing mdev_type custom properties, the VM still failed to run. Observing engine database for specific VM "vm_device" shows 15 devices with the same uuid, for example: engine=# select * from vm_device where vm_id='7feae268-6669-4ac4-920f-7177a43d7acd'; device_id | vm_id | type | device | address | s pec_params | is_managed | is_plugged | is_readonly | _create_date | _update_date | alias | custom_properties | snapshot_id | log ical_name | host_device --------------------------------------+--------------------------------------+------------+---------------+--------------------------------------------------------------+-------- ------------------+------------+------------+-------------+-------------------------------+-------------------------------+----------------+-------------------+-------------+---- ----------+------------- 93934a11-d492-4995-828c-98c61a920772 | 7feae268-6669-4ac4-920f-7177a43d7acd | graphics | spice | | | t | t | f | 2017-07-31 13:35:47.691817+03 | | | | | | b35aaa5e-b29b-4b7e-834c-2cbc63d1e158 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev | mdev | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a} | { } | f | t | f | 2017-07-31 13:55:09.336491+03 | 2017-08-09 16:50:01.459209+03 | | { } | | | bdd3c6da-0230-49b4-bf68-b7dcedcf6a77 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev | mdev | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a} | { } | f | t | f | 2017-08-02 11:16:16.144596+03 | 2017-08-09 16:50:01.459209+03 | | { } | | | 0301cf85-d155-4457-b288-a458b1b817e1 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev | mdev | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a} | { } | f | t | f | 2017-07-31 13:49:45.775929+03 | 2017-08-09 16:50:01.459209+03 | | { } | | | 080a73c7-a42c-46ac-a758-52f97f9fc3bd | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev | mdev | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a} | { } | f | t | f | 2017-07-31 12:04:10.825867+03 | 2017-08-09 16:50:01.459209+03 | | { } | | | 0cff3417-4ef3-42eb-b034-b088265cd134 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev | mdev | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a} | { } | f | t | f | 2017-07-31 14:23:12.984532+03 | 2017-08-09 16:50:01.459209+03 | | { } | | | Version-Release number of selected component (if applicable): vdsm-4.20.2-60.git06231e5.el7.centos.x86_64 (upgraded from vdsm-4.20.2-33.gite85019b.el7.centos.x86_64) libvirt-client-3.2.0-14.el7_4.2.x86_64 qemu-kvm-rhev-2.9.0-14.el7.x86_64 vdsm-hook-vfio-mdev-4.20.2-60.git06231e5.el7.centos.noarch (upgraded from vdsm-4.20.2-33.gite85019b.el7.centos.x86_64) Nvidia drivers: GRIDSW_5.0 Beta Release KVM Drivers (R384) engine kernel: kernel-3.10.0-693.el7.x86_64 Host kernel: kernel-3.10.0-693.1.1.el7.x86_64 (upgraded from kernel-3.10.0-693.el7.x86_64) How reproducible: 100% Steps to Reproduce: 1. Create VM with mdev_type hook, install Grid drivers on this VM and verify Nvidia drivers are running properly on that VM GPU. 2. Upgrade RHV environment (engine and host) 3. Try to run VM after upgrade. 4. Remove VM hook and try to run VM. 5. Run VM that was not with mdev_type hook before upgrade. Actual results: 3-4. In both cases VM failed to run. 5. VM is running properly. Expected results: VM should not fail to run. Additional info: engine.log (2017-08-13 13:37:41,930+0300 ERROR) and vdsm.log attached (see screenshot of too many mdev devices with same uuid attached.
Created attachment 1312676 [details] VM_devices screeshot
Created attachment 1312677 [details] vdsm.log
Created attachment 1312678 [details] engine.log
Something is wrong with the vdsm log attached. I can't find messages related to the VM '7feae268-6669-4ac4-920f-7177a43d7acd'. Anyway, in this case, it would be best to look at the system while it happens. Can you please try to reproduce it and call me to have a look?
I have an "old" VM with more than 1 mdev type device that failed to run. Please contact me when you can.
Nisim, so it doesn't happen anymore on the master branch. Could you verify that it happens in 4.1? Thanks.
It happens (mdev type is added to vm_devices with the same uuid) in 2 cases: 1. When removing mdev_type hook and adding another one. 2. After upgrading setup from 4.1.5-2 to 4.1.6-4 In both cases VM is running properly with Nvidia instance attached to VM.
targeting this to 4.2 since in 4.1 the issue is only that the vm devices subtab can show more devices than the VM actually has.
Verification builds: ovirt-engine-4.2.0-0.0.master.20171002190603.git3015ada.el7.centos libvirt-client-3.2.0-14.el7_4.3.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.8.x86_64 vdsm-4.20.3-128.git52f2c60.el7.centos.x86_64 vdsm-hook-vfio-mdev-4.20.3-128.git52f2c60.el7.centos NVIDIA-Linux-x86_64-384.37-vgpu-kvm Verification scenario: 1. Run VM with mdev_type hook. 2. Upgrade setup. 3. Verify VM is still running. verify only 1 mdev device is listed undev VM -> VM devices. 4. Power off and run VM again. 5. Verify Vm is running properly with Nvidia instance. 6. Import VM with multiple mdev VM devices from export domain (I've exported such problematic VM to export domain when this bug created). 7. Run VM and verify VM is running properly with Nvidia instance. Browse webadmin -> Virtual machines -> select imported VM -> VM devices, verify only 1 mdev device is now listed (the multiple old ones are actually removed).
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.