Bug 1542117

Summary: Disk is down after migration of vm from 4.1 to 4.2
Product: [oVirt] vdsm Reporter: Israel Pinto <ipinto>
Component: CoreAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: Israel Pinto <ipinto>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.20.15CC: ahadas, andrewclarkii, bugs, fromani, ipinto, klaas, linux, lveyde, michal.skrivanek, milan.zelenka, mtessun, ratamir
Target Milestone: ovirt-4.2.2Keywords: Regression
Target Release: ---Flags: rule-engine: ovirt-4.2+
mtessun: blocker+
mtessun: planning_ack+
rule-engine: devel_ack+
mavital: testing_ack+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: vdsm v4.20.19 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-29 10:57:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1516660    
Attachments:
Description Flags
failed_to_start_vm
none
source_host
none
destination_host
none
engine_log
none
36_engine_log
none
36_source_host
none
36_destination_host none

Description Israel Pinto 2018-02-05 15:38:15 UTC
Description of problem:
While verify the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1522878
Migrate VM from 4.1 host to 4.2 host.
Migration success but VM disk is down.
If you stop the VM it will not start - no boot disk. 

Version-Release number of selected component (if applicable):
Engine:4.2.1.3-0.1.el7
Host 4.1:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.17.1.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.13.1
LIBVIRT Version:libvirt-3.2.0-14.el7_4.7
VDSM Version:vdsm-4.20.17-1.el7ev
Host 4.2:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.14
LIBVIRT Version:libvirt-3.2.0-14.el7_4.9
VDSM Version:vdsm-4.19.45-1.el7ev

How reproducible:
100% 

Steps to Reproduce:
Migration vm with disk from 4.1 host to 4.2 host

Actual results:
VM is up on destination host, vm disk is down

Expected results:
VM is up on destination host, vm disk is up


Additional info:

Comment 1 Israel Pinto 2018-02-05 15:38:47 UTC
Created attachment 1391582 [details]
failed_to_start_vm

Comment 2 Israel Pinto 2018-02-05 15:42:22 UTC
Created attachment 1391583 [details]
source_host

Comment 3 Israel Pinto 2018-02-05 15:43:55 UTC
Created attachment 1391584 [details]
destination_host

Comment 4 Israel Pinto 2018-02-05 15:44:34 UTC
Created attachment 1391585 [details]
engine_log

Comment 5 Israel Pinto 2018-02-05 15:54:01 UTC
logs:
migration correlate-id: 9999a851-f40b-40a9-a283-b176f5f748ee
 
rose07_source_host/vdsm.log:2018-02-05 15:49:33,119+0200 INFO  (jsonrpc/1) [vdsm.api] START migrate(params={u'incomingLimit': 2, u'src': u'rose07.qa.lab.tlv.redhat.com', u'dstqemu': u'10.35.160.167', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'puma43.scl.lab.tlv.redhat.com:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 500, u'method': u'online', 'mode': 'remote'}) from=::ffff:10.35.161.176,47068, flow_id=9999a851-f40b-40a9-a283-b176f5f748ee (api:46)
rose07_source_host/vdsm.log:2018-02-05 15:49:33,120+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::ffff:10.35.161.176,47068, flow_id=9999a851-f40b-40a9-a283-b176f5f748ee (api:52)
engine.log:2018-02-05 15:49:32,674+02 INFO  [org.ovirt.engine.core.bll.MigrateVmCommand] (default task-2) [9999a851-f40b-40a9-a283-b176f5f748ee] Lock Acquired to object 'EngineLock:{exclusiveLocks='[8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5=VM]', sharedLocks=''}'
engine.log:2018-02-05 15:49:32,943+02 INFO  [org.ovirt.engine.core.bll.MigrateVmCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] Running command: MigrateVmCommand internal: false. Entities affected :  ID: 8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5 Type: VMAction group MIGRATE_VM with role type USER
engine.log:2018-02-05 15:49:33,112+02 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] START, MigrateVDSCommand( MigrateVDSCommandParameters:{hostId='44a279fe-164e-4219-898d-bf81be54f84d', vmId='8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5', srcHost='rose07.qa.lab.tlv.redhat.com', dstVdsId='3a711cc3-51eb-4a81-9b0d-7e9b90749808', dstHost='puma43.scl.lab.tlv.redhat.com:54321', migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='true', migrateCompressed='false', consoleAddress='null', maxBandwidth='500', enableGuestEvents='true', maxIncomingMigrations='2', maxOutgoingMigrations='2', convergenceSchedule='[init=[{name=setDowntime, params=[100]}], stalling=[{limit=1, action={name=setDowntime, params=[150]}}, {limit=2, action={name=setDowntime, params=[200]}}, {limit=3, action={name=setDowntime, params=[300]}}, {limit=4, action={name=setDowntime, params=[400]}}, {limit=6, action={name=setDowntime, params=[500]}}, {limit=-1, action={name=abort, params=[]}}]]', dstQemu='10.35.160.167'}), log id: 197a2e78
engine.log:2018-02-05 15:49:33,115+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] START, MigrateBrokerVDSCommand(HostName = rose_07, MigrateVDSCommandParameters:{hostId='44a279fe-164e-4219-898d-bf81be54f84d', vmId='8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5', srcHost='rose07.qa.lab.tlv.redhat.com', dstVdsId='3a711cc3-51eb-4a81-9b0d-7e9b90749808', dstHost='puma43.scl.lab.tlv.redhat.com:54321', migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='true', migrateCompressed='false', consoleAddress='null', maxBandwidth='500', enableGuestEvents='true', maxIncomingMigrations='2', maxOutgoingMigrations='2', convergenceSchedule='[init=[{name=setDowntime, params=[100]}], stalling=[{limit=1, action={name=setDowntime, params=[150]}}, {limit=2, action={name=setDowntime, params=[200]}}, {limit=3, action={name=setDowntime, params=[300]}}, {limit=4, action={name=setDowntime, params=[400]}}, {limit=6, action={name=setDowntime, params=[500]}}, {limit=-1, action={name=abort, params=[]}}]]', dstQemu='10.35.160.167'}), log id: 5366872b
engine.log:2018-02-05 15:49:33,120+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] FINISH, MigrateBrokerVDSCommand, log id: 5366872b
engine.log:2018-02-05 15:49:33,127+02 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] FINISH, MigrateVDSCommand, return: MigratingFrom, log id: 197a2e78
engine.log:2018-02-05 15:49:33,165+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: test_migration_bz, Source: rose_07, Destination: host_mixed_2, User: admin@internal-authz).

Comment 6 Michal Skrivanek 2018-02-06 10:55:49 UTC
Arik, those several messages starting with

2018-02-05 15:49:45,377+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-
65) [] VM '8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5' managed non pluggable device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{de
viceId='08ace43b-f172-4c99-8051-3b69389b7a36', vmId='8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5'}', device='virtio', type='RNG', specParams='[source=ura
ndom]', address='', managed='true', plugged='false', readOnly='false', deviceAlias='rng0', customProperties='[]', snapshotId='null', logicalName='
null', hostDevice=''}'

are not nice. Why do we still have those?

Comment 7 Michal Skrivanek 2018-02-06 11:06:16 UTC
Israel,
1) what do you mean by "vm disk is down" ?

2) I see that after the migration and VM shutdown you started it at
2018-02-05 15:53:58,955+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-8) [fdb012e0-833a-40ea-82a5-
e07e98e97fb3] EVENT_ID: USER_STARTED_VM(153), VM test_migration_bz was started by admin@internal-authz (Host: rose_07).

and it started up just fine. I do not see any failure in engine.log corresponding to your attached screenshot.

Please clarify

Comment 8 Michal Skrivanek 2018-02-06 11:08:00 UTC
Arik, also please comment on the seemingly bogus

2018-02-05 15:56:00,768+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [] VM '8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5'(test_migration_bz) was unexpectedly detected as 'MigratingTo' on VDS '3a711cc3-51eb-4a81-9b0d-7e9b90749808'(host_mixed_2) (expected on '44a279fe-164e-4219-898d-bf81be54f84d')

at the very end of the engine.log. We shouldn't print that when we migrate.

Comment 9 Israel Pinto 2018-02-06 11:37:15 UTC
(In reply to Michal Skrivanek from comment #7)
> Israel,
> 1) what do you mean by "vm disk is down" ?
> 
> 2) I see that after the migration and VM shutdown you started it at
> 2018-02-05 15:53:58,955+02 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (default task-8) [fdb012e0-833a-40ea-82a5-
> e07e98e97fb3] EVENT_ID: USER_STARTED_VM(153), VM test_migration_bz was
> started by admin@internal-authz (Host: rose_07).
> 
> and it started up just fine. I do not see any failure in engine.log
> corresponding to your attached screenshot.
> 
> Please clarify

Yes I active the disk and then VM started, see the screenshot with the message not bootable disk.
The VM disk was not active after the migration done. 
I also did not found any hint in the logs. 
Francesco Romani told me there was BZ in start VM that the disk is not active(down) but did not manage to find it maybe it the same problem.

Comment 10 Michal Skrivanek 2018-02-06 12:07:33 UTC
Please do not remove other needinfos

Comment 11 Israel Pinto 2018-02-06 12:24:27 UTC
(In reply to Michal Skrivanek from comment #10)
> Please do not remove other needinfos

I did not remove them, maybe Bugzilla issue

Comment 12 Arik 2018-02-06 13:41:54 UTC
(In reply to Michal Skrivanek from comment #6)
> Arik, those several messages starting with
> 
> 2018-02-05 15:49:45,377+02 ERROR
> [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring]
> (EE-ManagedThreadFactory-engineScheduled-Thread-
> 65) [] VM '8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5' managed non pluggable
> device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{de
> viceId='08ace43b-f172-4c99-8051-3b69389b7a36',
> vmId='8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5'}', device='virtio', type='RNG',
> specParams='[source=ura
> ndom]', address='', managed='true', plugged='false', readOnly='false',
> deviceAlias='rng0', customProperties='[]', snapshotId='null', logicalName='
> null', hostDevice=''}'
> 
> are not nice. Why do we still have those?

Well, I agree they are not that nice but they serve their purpose as an indicator of possible issues. In this case, I wonder why the RNG device disappeared. Maybe we should update the XML we receive in the devices monitoring for now because of the recent changes that increase the probability of issues.

Comment 13 Arik 2018-02-06 13:42:31 UTC
> Well, I agree they are not that nice but they serve their purpose as an
> indicator of possible issues. In this case, I wonder why the RNG device
> disappeared. Maybe we should update the XML we receive in the devices

Maybe we should log*

> monitoring for now because of the recent changes that increase the
> probability of issues.

Comment 14 Arik 2018-02-06 13:54:01 UTC
(In reply to Michal Skrivanek from comment #8)
> at the very end of the engine.log. We shouldn't print that when we migrate.

We can skip that log in this particular case but that would complicate the code.
I would rather keep it that way and ask those that read the log to think of it as: "the engine just got a report on this VM from a host that is different than the one the VM is supposed to run on" (which is a correct and general message) and then an explanation is given right afterwards "but that's ok since the VM is migrating to that host" (VM .. is migrating to VDSM ... ignoring it in the refresh until migration is done).

Comment 15 Michal Skrivanek 2018-02-07 08:08:32 UTC
(In reply to Arik from comment #14)
> (In reply to Michal Skrivanek from comment #8)
> > at the very end of the engine.log. We shouldn't print that when we migrate.
> 
> We can skip that log in this particular case but that would complicate the
> code.
> I would rather keep it that way and ask those that read the log to think of
> it as: "the engine just got a report on this VM from a host that is
> different than the one the VM is supposed to run on" (which is a correct and
> general message) and then an explanation is given right afterwards "but
> that's ok since the VM is migrating to that host" (VM .. is migrating to
> VDSM ... ignoring it in the refresh until migration is done).

I would prefer the code to handle that. Logs really need to be concise. We need to get rid of all the noise and misleading junk

Comment 16 Michal Skrivanek 2018-02-07 08:09:12 UTC
(In reply to Israel Pinto from comment #11)
> (In reply to Michal Skrivanek from comment #10)
> > Please do not remove other needinfos
> 
> I did not remove them, maybe Bugzilla issue

it's not a bugzilla issue, it's your update;-) you need to be careful when there are multiple needinfos

Comment 17 Francesco Romani 2018-02-07 09:06:16 UTC
We discovered the cause of disappearing devices: the backward compatibility of Vdsm 4.2 running in 4.1 clusters as 4.1. host is not completed:
1. the deviceId WAS lost during migration - should be fixed by http://gerrit.ovirt.org/87213
2. the deviceId is NOT stored anywhere, so it is going to be lost if Vdsm is restarted -> needs another fix
3. not strictly needed by this BZ but extremely helpful to reduce the chance of future bugs, we need to clean up the flows here and clearly distinguish the backward-compatible flows.

Comment 18 Francesco Romani 2018-02-12 07:55:35 UTC
We merged all the patches in master which fixes the bug. The other patches attached are pro-active to improve the backward compatibility.

Comment 19 Francesco Romani 2018-02-12 09:33:33 UTC
actually a Vdsm bug, not Engine.

Comment 20 Israel Pinto 2018-02-12 09:41:00 UTC
Created attachment 1394824 [details]
36_engine_log

Comment 21 Israel Pinto 2018-02-12 09:41:38 UTC
Created attachment 1394825 [details]
36_source_host

Comment 22 Israel Pinto 2018-02-12 09:42:06 UTC
Created attachment 1394826 [details]
36_destination_host

Comment 23 Francesco Romani 2018-02-12 09:45:00 UTC
(In reply to Israel Pinto from comment #21)
> Created attachment 1394825 [details]
> 36_source_host

The destination side is 4.20.17, so it is the same bug, fixed by https://gerrit.ovirt.org/#/c/87250/.

The next 4.20.z should contain all the needed fixes.

Comment 24 Israel Pinto 2018-02-12 09:49:57 UTC
The problem is also happened in 3.6 enigne
Engine Version: 3.6.12.3-0.1.el6
Destination host:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.17.1.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.14
LIBVIRT Version:libvirt-3.2.0-14.el7_4.9
VDSM Version:vdsm-4.20.17-1.el7ev

Source host:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.17.1.el7.x86_64
KVM Version:2.6.0 - 28.el7_3.15
LIBVIRT Version:libvirt-3.2.0-14.el7_4.9
VDSM Version:vdsm-4.17.43-1.el7ev


See logs attched

Comment 25 Francesco Romani 2018-02-14 09:49:08 UTC
no doc_text needed, should Just Work

Comment 26 Francesco Romani 2018-02-14 11:20:55 UTC
all patches merged to master, backport in progress.

Comment 27 Francesco Romani 2018-02-14 12:14:02 UTC
Nope, still POST: first round of patches was merged on 4.2 branch, but another (and last) round is due, and this will likely miss the 4.20.18 tag, but should totally make it in time for 4.20.19 - thus 4.2.2 GA

Comment 28 Israel Pinto 2018-02-22 14:41:13 UTC
Verify with:
Engine version:4.2.2.1-0.1.el7
Host 4.2:
OS Version:RHEL - 7.5 - 6.el7
Kernel Version:3.10.0 - 855.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.13.1
LIBVIRT Version:libvirt-3.9.0-13.el7
VDSM Version:vdsm-4.20.19-1.el7ev
Host 4.1:
OS Version:RHEL - 7.5 - 6.el7
Kernel Version:3.10.0 - 851.el7.x86_64
KVM Version:2.10.0 - 21.el7
LIBVIRT Version:libvirt-3.9.0-13.el7
VDSM Version:vdsm-4.19.46-1.el7ev

Steps:
1. Start VM on 4.1 host
2. Migrate VM to 4.2 Host

VM with snapshot
VM with RNG (urandon) 
VM with RNG (hwrng) 
VM with hotplug memory and CPU 
VM with spice + 4 monitors
VM with VNC
Vm in pause state
Headless VM 
VM with Direct LUN 
VM with disk ISCSI 

All pass

Comment 29 Andy Clark 2018-03-05 12:53:32 UTC
It look's like our case. My colleagues have found this workaround:
1 Connect to engine db:

-bash-4.2$ psql 
psql (9.2.23, server 9.5.9)
WARNING: psql version 9.2, server version 9.5.
         Some psql features might not work.
Type "help" for help.

postgres=# \l
                                             List of databases
         Name         |        Owner         | Encoding |   Collate   |    Ctype    |   Access privileges   
----------------------+----------------------+----------+-------------+-------------+-----------------------
 engine               | engine               | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 ovirt_engine_history | ovirt_engine_history | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 ovirt_engine_reports | ovirt_engine_reports | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres             | postgres             | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0            | postgres             | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
                      |                      |          |             |             | postgres=CTc/postgres
 template1            | postgres             | UTF8     | en_US.UTF-8 | en_US.UTF-8 | postgres=CTc/postgres+
                      |                      |          |             |             | =c/postgres
(6 rows)

postgres=# \connect engine 
psql (9.2.23, server 9.5.9)
WARNING: psql version 9.2, server version 9.5.
         Some psql features might not work.
You are now connected to database "engine" as user "postgres".
engine=#

2. Now you should take information about unmanaged devices from vm table:

engine=# select * from vm_device where type in ('disk', 'video', 'balloon', 'interface') and vm_id = 'ef79dedb-8405-4172-9d1d-1841d230f9fd' and is_managed = 'f';

3. Shutdown your vm from oVirt web interface.
4. Delete all unmanaged devices from vm table:
engine=# delete from vm_device where type in ('disk', 'video', 'balloon', 'interface') and vm_id = 'ef79dedb-8405-4172-9d1d-1841d230f9fd' and is_managed = 'f';
5. Start your vm from oVirt web interface. It should boot fine now.

Comment 30 Arik 2018-03-05 16:43:25 UTC
(In reply to Andy Clark from comment #29)
I doubt those steps would fix this bug:
1. disk/video/balloon/interfaces devices are never created as unmanaged devices
2. without attaching the unplugged disks to the VM, the VM is not supposed to boot (unless you configured another bootable device like a network interface) and even if it boots, it won't have its disks.

Comment 31 Andy Clark 2018-03-06 05:50:04 UTC
> I doubt those steps would fix this bug:

They should not. It is just workaround. Fixes should be in the code, I presume.

> 1. disk/video/balloon/interfaces devices are never created as unmanaged devices

It is true, but we not talking about how they were created, but about what happened with them after migration.

> 2. without attaching the unplugged disks to the VM, the VM is not supposed to boot (unless you configured another bootable device like a network interface) and even if it boots, it won't have its disks.

It is also true, but we do not actually detach them. It is about records in db, witch is duplicated and stayed in unmanaged state.

Comment 32 Arik 2018-03-06 07:43:05 UTC
(In reply to Andy Clark from comment #31)
> > 1. disk/video/balloon/interfaces devices are never created as unmanaged devices
> 
> It is true, but we not talking about how they were created, but about what
> happened with them after migration.

Right, I meant that on the engine side those devices are never supposed to be created as unmanaged devices - either it upon starting the VM or migrating it. For instance, when VDSM reports a disk that cannot be correlated with one of the devices the engine knows about, this device is not added as an unmanaged device but rather ignored [1].

> 
> > 2. without attaching the unplugged disks to the VM, the VM is not supposed to boot (unless you configured another bootable device like a network interface) and even if it boots, it won't have its disks.
> 
> It is also true, but we do not actually detach them. It is about records in
> db, witch is duplicated and stayed in unmanaged state.

That's not accurate - it is true that the disk is not actually unplugged from the running VM, but on the engine side the disk is actually detached from the VM. We hold a relationship between each disk and the VM(s) that uses it - in case the disk's device is shown as unplugged, it means this relation is updated in a way that the disk is detached from the VM and therefore won't be part of the 'hardware' of the VM the next time it is started (unless you activate it).

[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/libvirt/VmDevicesConverter.java#L432

Comment 33 Sandro Bonazzola 2018-03-29 10:57:19 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 34 Chris Adams 2018-04-16 17:31:34 UTC
I just updated a 4.2.1 cluster with this issue to 4.2.2, and the VMs from before 4.2 are still in the same state - network unplugged and disk deactivated.

Comment 35 Israel Pinto 2018-04-17 06:31:31 UTC
(In reply to Chris Adams from comment #34)
> I just updated a 4.2.1 cluster with this issue to 4.2.2, and the VMs from
> before 4.2 are still in the same state - network unplugged and disk
> deactivated.

Hi Chris,
Can you share logs.

Francesco, should we file new BZ?

Comment 36 Francesco Romani 2018-04-17 06:32:57 UTC
(In reply to Israel Pinto from comment #35)
> (In reply to Chris Adams from comment #34)
> > I just updated a 4.2.1 cluster with this issue to 4.2.2, and the VMs from
> > before 4.2 are still in the same state - network unplugged and disk
> > deactivated.
> 
> Hi Chris,
> Can you share logs.
> 
> Francesco, should we file new BZ?

yes please - but with the logs

Comment 37 Israel Pinto 2018-04-17 06:37:08 UTC
(In reply to Francesco Romani from comment #36)
> (In reply to Israel Pinto from comment #35)
> > (In reply to Chris Adams from comment #34)
> > > I just updated a 4.2.1 cluster with this issue to 4.2.2, and the VMs from
> > > before 4.2 are still in the same state - network unplugged and disk
> > > deactivated.
> > 
> > Hi Chris,
> > Can you share logs.
> > 
> > Francesco, should we file new BZ?
> 
> yes please - but with the logs

Hi Chris,
Can you share logs.

Comment 38 Chris Adams 2018-04-17 12:42:13 UTC
Which logs specifically would you like? Do you want them on this bug, or should I go ahead and create a new one?

Comment 39 Francesco Romani 2018-04-17 13:07:59 UTC
(In reply to Chris Adams from comment #38)
> Which logs specifically would you like? Do you want them on this bug, or
> should I go ahead and create a new one?

Please file a new bug. I'm afraid the VMs will need manual fixing like outlined by Arik previously (e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1542117#c32)

AFAIK there is no way to automatically fix the issue. The good news is that nothing should be lost, it's just that Engine lost track of which device belonged to which VM.

Let's talk in the new BZ about any needed log, depending on the specific problem reported it is possible we need no new logs.

Comment 40 Chris Adams 2018-04-17 13:17:24 UTC
Before I open a new bug - I guess if the issue is not expected to be automatically fixed, then I misunderstood. My dev system was upgraded from 4.1 to 4.2.0, 4.2.1, and then 4.2.2, and I haven't rebooted all the VMs. Is the fix supposed to handle future 4.1.x->4.2.2 (or higher) upgrades correctly, or am I going to have to manually fix each VM?

Comment 41 Francesco Romani 2018-04-17 13:53:37 UTC
(In reply to Chris Adams from comment #40)
> Before I open a new bug - I guess if the issue is not expected to be
> automatically fixed, then I misunderstood. My dev system was upgraded from
> 4.1 to 4.2.0, 4.2.1, and then 4.2.2, and I haven't rebooted all the VMs. Is
> the fix supposed to handle future 4.1.x->4.2.2 (or higher) upgrades
> correctly, or am I going to have to manually fix each VM?

Vdsm is the component that caused the device to be disassociated from their VMs (it failed to report some key info to Engine). Unfortunately Vdsm cannot automatically fix this. So either Engine can, or people will need to fix this manually - or create some script to do that. Arik, any insight?

Comment 42 Arik 2018-04-23 11:56:54 UTC
(In reply to Francesco Romani from comment #41)
> Vdsm is the component that caused the device to be disassociated from their
> VMs (it failed to report some key info to Engine). Unfortunately Vdsm cannot
> automatically fix this. So either Engine can, or people will need to fix
> this manually - or create some script to do that. Arik, any insight?

I agree with the above - at the moment, VMs whose devices were correlated with the third mechanism but had incorrect UUIDs need to be fixed manually or with a script (relatively simple script btw).

We currently have 3 mechanisms for correlating the devices reported by VDSM with those that appear in the database:

1. By user-aliases - that's the best approach but it only applies to VMs that were started in recent versions of oVirt and recent versions of libvirt.

2. By device properties - that's what we use when user-aliases are not available in 4.2 clusters (e.g., on Centos 7.4).

3. By device UUIDs - that is intended for cluster levels lower than 4.2. It assumes VDSM reports devices with the UUIDs that were assigned by the engine.

We have to use the third mechanism for VDSM <= 4.1 (that don't support dumpxmls).
However, it may be possible to improve the second mechanism and then on cluster versions < 4.2 to try using dumpxmls and if it is supported to use it. It may be possible to fix 'corrupted' VMs that way.