Bug 1481007

Summary: vGPU: VMs with mdev_type hook failed to run after RHV upgrade, even if the hook removed.
Product: [oVirt] ovirt-engine Reporter: Nisim Simsolo <nsimsolo>
Component: BLL.VirtAssignee: Arik <ahadas>
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.2.0CC: ahadas, bugs, nsimsolo, tjelinek
Target Milestone: ovirt-4.2.0Flags: rule-engine: ovirt-4.2+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-20 11:37:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1486524    
Attachments:
Description Flags
VM_devices screeshot
none
vdsm.log
none
engine.log none

Description Nisim Simsolo 2017-08-13 12:40:18 UTC
Description of problem:
VMs with mdev_type configuration failed to run After upgrading RHV setup to ovirt-engine-4.2.0-0.0.master.20170811144920.gita423008.el7.centos. (from ovirt-engine-4.2.0-0.0.master.20170728194615.gitec6aa15.el7.centos)
Also, when removing mdev_type custom properties, the VM still failed to run.
Observing engine database for specific VM "vm_device" shows 15 devices with the same uuid, for example: 
engine=# select * from vm_device where vm_id='7feae268-6669-4ac4-920f-7177a43d7acd';
              device_id               |                vm_id                 |    type    |    device     |                           address                            |       s
pec_params        | is_managed | is_plugged | is_readonly |         _create_date          |         _update_date          |     alias      | custom_properties | snapshot_id | log
ical_name | host_device 
--------------------------------------+--------------------------------------+------------+---------------+--------------------------------------------------------------+--------
------------------+------------+------------+-------------+-------------------------------+-------------------------------+----------------+-------------------+-------------+----
----------+-------------
 93934a11-d492-4995-828c-98c61a920772 | 7feae268-6669-4ac4-920f-7177a43d7acd | graphics   | spice         |                                                              |        
                  | t          | t          | f           | 2017-07-31 13:35:47.691817+03 |                               |                |                   |             |    
          | 
 b35aaa5e-b29b-4b7e-834c-2cbc63d1e158 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 13:55:09.336491+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 bdd3c6da-0230-49b4-bf68-b7dcedcf6a77 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-08-02 11:16:16.144596+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 0301cf85-d155-4457-b288-a458b1b817e1 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 13:49:45.775929+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 080a73c7-a42c-46ac-a758-52f97f9fc3bd | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 12:04:10.825867+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 
 0cff3417-4ef3-42eb-b034-b088265cd134 | 7feae268-6669-4ac4-920f-7177a43d7acd | hostdev    | mdev          | {uuid=b06ee19e-3368-39c8-9d6f-f88abfb29f8a}                  | { }    
                  | f          | t          | f           | 2017-07-31 14:23:12.984532+03 | 2017-08-09 16:50:01.459209+03 |                | { }               |             |    
          | 


Version-Release number of selected component (if applicable):
vdsm-4.20.2-60.git06231e5.el7.centos.x86_64 (upgraded from    vdsm-4.20.2-33.gite85019b.el7.centos.x86_64)
libvirt-client-3.2.0-14.el7_4.2.x86_64
qemu-kvm-rhev-2.9.0-14.el7.x86_64
vdsm-hook-vfio-mdev-4.20.2-60.git06231e5.el7.centos.noarch (upgraded from vdsm-4.20.2-33.gite85019b.el7.centos.x86_64)
Nvidia drivers: GRIDSW_5.0 Beta Release KVM Drivers (R384)
engine kernel: kernel-3.10.0-693.el7.x86_64
Host kernel: kernel-3.10.0-693.1.1.el7.x86_64 (upgraded from kernel-3.10.0-693.el7.x86_64)

How reproducible:
100%

Steps to Reproduce:
1. Create VM with mdev_type hook, install Grid drivers on this VM and verify Nvidia drivers are running properly on that VM GPU.
2. Upgrade RHV environment (engine and host)
3. Try to run VM after upgrade.
4. Remove VM hook and try to run VM.
5. Run VM that was not with mdev_type hook before upgrade. 

Actual results:
3-4. In both cases VM failed to run.
5. VM is running properly.

Expected results:
VM should not fail to run.

Additional info:
engine.log (2017-08-13 13:37:41,930+0300 ERROR) and vdsm.log attached (see screenshot of too many mdev devices with same uuid attached.

Comment 1 Nisim Simsolo 2017-08-13 12:47:38 UTC
Created attachment 1312676 [details]
VM_devices screeshot

Comment 2 Nisim Simsolo 2017-08-13 12:48:03 UTC
Created attachment 1312677 [details]
vdsm.log

Comment 3 Nisim Simsolo 2017-08-13 12:48:29 UTC
Created attachment 1312678 [details]
engine.log

Comment 5 Arik 2017-08-23 15:32:31 UTC
Something is wrong with the vdsm log attached. I can't find messages related to the VM '7feae268-6669-4ac4-920f-7177a43d7acd'. Anyway, in this case, it would be best to look at the system while it happens. Can you please try to reproduce it and call me to have a look?

Comment 6 Nisim Simsolo 2017-08-24 08:31:23 UTC
I have an "old" VM with more than 1 mdev type device that failed to run. Please contact me when you can.

Comment 7 Arik 2017-08-29 08:40:14 UTC
Nisim, so it doesn't happen anymore on the master branch. Could you verify that it happens in 4.1?
Thanks.

Comment 8 Nisim Simsolo 2017-09-19 14:25:02 UTC
It happens (mdev type is added to vm_devices with the same uuid) in 2 cases:
1. When removing mdev_type hook and adding another one.
2. After upgrading setup from 4.1.5-2 to 4.1.6-4

In both cases VM is running properly with Nvidia instance attached to VM.

Comment 9 Tomas Jelinek 2017-09-25 08:59:55 UTC
targeting this to 4.2 since in 4.1 the issue is only that the vm devices subtab can show more devices than the VM actually has.

Comment 10 Nisim Simsolo 2017-10-03 14:15:53 UTC
Verification builds: 
ovirt-engine-4.2.0-0.0.master.20171002190603.git3015ada.el7.centos
libvirt-client-3.2.0-14.el7_4.3.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.8.x86_64
vdsm-4.20.3-128.git52f2c60.el7.centos.x86_64
vdsm-hook-vfio-mdev-4.20.3-128.git52f2c60.el7.centos
NVIDIA-Linux-x86_64-384.37-vgpu-kvm

Verification scenario:
1. Run VM with mdev_type hook.
2. Upgrade setup.
3. Verify VM is still running. verify only 1 mdev device is listed undev VM -> VM devices.
4. Power off and run VM again.
5. Verify Vm is running properly with Nvidia instance.
6. Import VM with multiple mdev VM devices from export domain (I've exported such problematic VM to export domain when this bug created).
7. Run VM and verify VM is running properly with Nvidia instance. Browse webadmin -> Virtual machines -> select imported VM -> VM devices, verify only 1 mdev device is now listed (the multiple old ones are actually removed).

Comment 11 Sandro Bonazzola 2017-12-20 11:37:05 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.