Bug 1542117
Summary: | Disk is down after migration of vm from 4.1 to 4.2 | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Israel Pinto <ipinto> | ||||||||||||||||
Component: | Core | Assignee: | Francesco Romani <fromani> | ||||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Israel Pinto <ipinto> | ||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||
Priority: | high | ||||||||||||||||||
Version: | 4.20.15 | CC: | ahadas, andrewclarkii, bugs, fromani, ipinto, klaas, linux, lveyde, michal.skrivanek, milan.zelenka, mtessun, ratamir | ||||||||||||||||
Target Milestone: | ovirt-4.2.2 | Keywords: | Regression | ||||||||||||||||
Target Release: | --- | Flags: | rule-engine:
ovirt-4.2+
mtessun: blocker+ mtessun: planning_ack+ rule-engine: devel_ack+ mavital: testing_ack+ |
||||||||||||||||
Hardware: | All | ||||||||||||||||||
OS: | All | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | vdsm v4.20.19 | Doc Type: | If docs needed, set a value | ||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2018-03-29 10:57:19 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 1516660 | ||||||||||||||||||
Attachments: |
|
Description
Israel Pinto
2018-02-05 15:38:15 UTC
Created attachment 1391582 [details]
failed_to_start_vm
Created attachment 1391583 [details]
source_host
Created attachment 1391584 [details]
destination_host
Created attachment 1391585 [details]
engine_log
logs: migration correlate-id: 9999a851-f40b-40a9-a283-b176f5f748ee rose07_source_host/vdsm.log:2018-02-05 15:49:33,119+0200 INFO (jsonrpc/1) [vdsm.api] START migrate(params={u'incomingLimit': 2, u'src': u'rose07.qa.lab.tlv.redhat.com', u'dstqemu': u'10.35.160.167', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'puma43.scl.lab.tlv.redhat.com:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 500, u'method': u'online', 'mode': 'remote'}) from=::ffff:10.35.161.176,47068, flow_id=9999a851-f40b-40a9-a283-b176f5f748ee (api:46) rose07_source_host/vdsm.log:2018-02-05 15:49:33,120+0200 INFO (jsonrpc/1) [vdsm.api] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::ffff:10.35.161.176,47068, flow_id=9999a851-f40b-40a9-a283-b176f5f748ee (api:52) engine.log:2018-02-05 15:49:32,674+02 INFO [org.ovirt.engine.core.bll.MigrateVmCommand] (default task-2) [9999a851-f40b-40a9-a283-b176f5f748ee] Lock Acquired to object 'EngineLock:{exclusiveLocks='[8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5=VM]', sharedLocks=''}' engine.log:2018-02-05 15:49:32,943+02 INFO [org.ovirt.engine.core.bll.MigrateVmCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] Running command: MigrateVmCommand internal: false. Entities affected : ID: 8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5 Type: VMAction group MIGRATE_VM with role type USER engine.log:2018-02-05 15:49:33,112+02 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] START, MigrateVDSCommand( MigrateVDSCommandParameters:{hostId='44a279fe-164e-4219-898d-bf81be54f84d', vmId='8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5', srcHost='rose07.qa.lab.tlv.redhat.com', dstVdsId='3a711cc3-51eb-4a81-9b0d-7e9b90749808', dstHost='puma43.scl.lab.tlv.redhat.com:54321', migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='true', migrateCompressed='false', consoleAddress='null', maxBandwidth='500', enableGuestEvents='true', maxIncomingMigrations='2', maxOutgoingMigrations='2', convergenceSchedule='[init=[{name=setDowntime, params=[100]}], stalling=[{limit=1, action={name=setDowntime, params=[150]}}, {limit=2, action={name=setDowntime, params=[200]}}, {limit=3, action={name=setDowntime, params=[300]}}, {limit=4, action={name=setDowntime, params=[400]}}, {limit=6, action={name=setDowntime, params=[500]}}, {limit=-1, action={name=abort, params=[]}}]]', dstQemu='10.35.160.167'}), log id: 197a2e78 engine.log:2018-02-05 15:49:33,115+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] START, MigrateBrokerVDSCommand(HostName = rose_07, MigrateVDSCommandParameters:{hostId='44a279fe-164e-4219-898d-bf81be54f84d', vmId='8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5', srcHost='rose07.qa.lab.tlv.redhat.com', dstVdsId='3a711cc3-51eb-4a81-9b0d-7e9b90749808', dstHost='puma43.scl.lab.tlv.redhat.com:54321', migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='true', migrateCompressed='false', consoleAddress='null', maxBandwidth='500', enableGuestEvents='true', maxIncomingMigrations='2', maxOutgoingMigrations='2', convergenceSchedule='[init=[{name=setDowntime, params=[100]}], stalling=[{limit=1, action={name=setDowntime, params=[150]}}, {limit=2, action={name=setDowntime, params=[200]}}, {limit=3, action={name=setDowntime, params=[300]}}, {limit=4, action={name=setDowntime, params=[400]}}, {limit=6, action={name=setDowntime, params=[500]}}, {limit=-1, action={name=abort, params=[]}}]]', dstQemu='10.35.160.167'}), log id: 5366872b engine.log:2018-02-05 15:49:33,120+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] FINISH, MigrateBrokerVDSCommand, log id: 5366872b engine.log:2018-02-05 15:49:33,127+02 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] FINISH, MigrateVDSCommand, return: MigratingFrom, log id: 197a2e78 engine.log:2018-02-05 15:49:33,165+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163118) [9999a851-f40b-40a9-a283-b176f5f748ee] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: test_migration_bz, Source: rose_07, Destination: host_mixed_2, User: admin@internal-authz). Arik, those several messages starting with 2018-02-05 15:49:45,377+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread- 65) [] VM '8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5' managed non pluggable device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{de viceId='08ace43b-f172-4c99-8051-3b69389b7a36', vmId='8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5'}', device='virtio', type='RNG', specParams='[source=ura ndom]', address='', managed='true', plugged='false', readOnly='false', deviceAlias='rng0', customProperties='[]', snapshotId='null', logicalName=' null', hostDevice=''}' are not nice. Why do we still have those? Israel, 1) what do you mean by "vm disk is down" ? 2) I see that after the migration and VM shutdown you started it at 2018-02-05 15:53:58,955+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-8) [fdb012e0-833a-40ea-82a5- e07e98e97fb3] EVENT_ID: USER_STARTED_VM(153), VM test_migration_bz was started by admin@internal-authz (Host: rose_07). and it started up just fine. I do not see any failure in engine.log corresponding to your attached screenshot. Please clarify Arik, also please comment on the seemingly bogus 2018-02-05 15:56:00,768+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [] VM '8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5'(test_migration_bz) was unexpectedly detected as 'MigratingTo' on VDS '3a711cc3-51eb-4a81-9b0d-7e9b90749808'(host_mixed_2) (expected on '44a279fe-164e-4219-898d-bf81be54f84d') at the very end of the engine.log. We shouldn't print that when we migrate. (In reply to Michal Skrivanek from comment #7) > Israel, > 1) what do you mean by "vm disk is down" ? > > 2) I see that after the migration and VM shutdown you started it at > 2018-02-05 15:53:58,955+02 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (default task-8) [fdb012e0-833a-40ea-82a5- > e07e98e97fb3] EVENT_ID: USER_STARTED_VM(153), VM test_migration_bz was > started by admin@internal-authz (Host: rose_07). > > and it started up just fine. I do not see any failure in engine.log > corresponding to your attached screenshot. > > Please clarify Yes I active the disk and then VM started, see the screenshot with the message not bootable disk. The VM disk was not active after the migration done. I also did not found any hint in the logs. Francesco Romani told me there was BZ in start VM that the disk is not active(down) but did not manage to find it maybe it the same problem. Please do not remove other needinfos (In reply to Michal Skrivanek from comment #10) > Please do not remove other needinfos I did not remove them, maybe Bugzilla issue (In reply to Michal Skrivanek from comment #6) > Arik, those several messages starting with > > 2018-02-05 15:49:45,377+02 ERROR > [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] > (EE-ManagedThreadFactory-engineScheduled-Thread- > 65) [] VM '8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5' managed non pluggable > device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{de > viceId='08ace43b-f172-4c99-8051-3b69389b7a36', > vmId='8fd0d8e4-44b6-40a7-8ee6-a24d5dda1ca5'}', device='virtio', type='RNG', > specParams='[source=ura > ndom]', address='', managed='true', plugged='false', readOnly='false', > deviceAlias='rng0', customProperties='[]', snapshotId='null', logicalName=' > null', hostDevice=''}' > > are not nice. Why do we still have those? Well, I agree they are not that nice but they serve their purpose as an indicator of possible issues. In this case, I wonder why the RNG device disappeared. Maybe we should update the XML we receive in the devices monitoring for now because of the recent changes that increase the probability of issues. > Well, I agree they are not that nice but they serve their purpose as an > indicator of possible issues. In this case, I wonder why the RNG device > disappeared. Maybe we should update the XML we receive in the devices Maybe we should log* > monitoring for now because of the recent changes that increase the > probability of issues. (In reply to Michal Skrivanek from comment #8) > at the very end of the engine.log. We shouldn't print that when we migrate. We can skip that log in this particular case but that would complicate the code. I would rather keep it that way and ask those that read the log to think of it as: "the engine just got a report on this VM from a host that is different than the one the VM is supposed to run on" (which is a correct and general message) and then an explanation is given right afterwards "but that's ok since the VM is migrating to that host" (VM .. is migrating to VDSM ... ignoring it in the refresh until migration is done). (In reply to Arik from comment #14) > (In reply to Michal Skrivanek from comment #8) > > at the very end of the engine.log. We shouldn't print that when we migrate. > > We can skip that log in this particular case but that would complicate the > code. > I would rather keep it that way and ask those that read the log to think of > it as: "the engine just got a report on this VM from a host that is > different than the one the VM is supposed to run on" (which is a correct and > general message) and then an explanation is given right afterwards "but > that's ok since the VM is migrating to that host" (VM .. is migrating to > VDSM ... ignoring it in the refresh until migration is done). I would prefer the code to handle that. Logs really need to be concise. We need to get rid of all the noise and misleading junk (In reply to Israel Pinto from comment #11) > (In reply to Michal Skrivanek from comment #10) > > Please do not remove other needinfos > > I did not remove them, maybe Bugzilla issue it's not a bugzilla issue, it's your update;-) you need to be careful when there are multiple needinfos We discovered the cause of disappearing devices: the backward compatibility of Vdsm 4.2 running in 4.1 clusters as 4.1. host is not completed: 1. the deviceId WAS lost during migration - should be fixed by http://gerrit.ovirt.org/87213 2. the deviceId is NOT stored anywhere, so it is going to be lost if Vdsm is restarted -> needs another fix 3. not strictly needed by this BZ but extremely helpful to reduce the chance of future bugs, we need to clean up the flows here and clearly distinguish the backward-compatible flows. We merged all the patches in master which fixes the bug. The other patches attached are pro-active to improve the backward compatibility. actually a Vdsm bug, not Engine. Created attachment 1394824 [details]
36_engine_log
Created attachment 1394825 [details]
36_source_host
Created attachment 1394826 [details]
36_destination_host
(In reply to Israel Pinto from comment #21) > Created attachment 1394825 [details] > 36_source_host The destination side is 4.20.17, so it is the same bug, fixed by https://gerrit.ovirt.org/#/c/87250/. The next 4.20.z should contain all the needed fixes. The problem is also happened in 3.6 enigne Engine Version: 3.6.12.3-0.1.el6 Destination host: OS Version:RHEL - 7.4 - 18.el7 Kernel Version:3.10.0 - 693.17.1.el7.x86_64 KVM Version:2.9.0 - 16.el7_4.14 LIBVIRT Version:libvirt-3.2.0-14.el7_4.9 VDSM Version:vdsm-4.20.17-1.el7ev Source host: OS Version:RHEL - 7.4 - 18.el7 Kernel Version:3.10.0 - 693.17.1.el7.x86_64 KVM Version:2.6.0 - 28.el7_3.15 LIBVIRT Version:libvirt-3.2.0-14.el7_4.9 VDSM Version:vdsm-4.17.43-1.el7ev See logs attched no doc_text needed, should Just Work all patches merged to master, backport in progress. Nope, still POST: first round of patches was merged on 4.2 branch, but another (and last) round is due, and this will likely miss the 4.20.18 tag, but should totally make it in time for 4.20.19 - thus 4.2.2 GA Verify with: Engine version:4.2.2.1-0.1.el7 Host 4.2: OS Version:RHEL - 7.5 - 6.el7 Kernel Version:3.10.0 - 855.el7.x86_64 KVM Version:2.9.0 - 16.el7_4.13.1 LIBVIRT Version:libvirt-3.9.0-13.el7 VDSM Version:vdsm-4.20.19-1.el7ev Host 4.1: OS Version:RHEL - 7.5 - 6.el7 Kernel Version:3.10.0 - 851.el7.x86_64 KVM Version:2.10.0 - 21.el7 LIBVIRT Version:libvirt-3.9.0-13.el7 VDSM Version:vdsm-4.19.46-1.el7ev Steps: 1. Start VM on 4.1 host 2. Migrate VM to 4.2 Host VM with snapshot VM with RNG (urandon) VM with RNG (hwrng) VM with hotplug memory and CPU VM with spice + 4 monitors VM with VNC Vm in pause state Headless VM VM with Direct LUN VM with disk ISCSI All pass It look's like our case. My colleagues have found this workaround: 1 Connect to engine db: -bash-4.2$ psql psql (9.2.23, server 9.5.9) WARNING: psql version 9.2, server version 9.5. Some psql features might not work. Type "help" for help. postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges ----------------------+----------------------+----------+-------------+-------------+----------------------- engine | engine | UTF8 | en_US.UTF-8 | en_US.UTF-8 | ovirt_engine_history | ovirt_engine_history | UTF8 | en_US.UTF-8 | en_US.UTF-8 | ovirt_engine_reports | ovirt_engine_reports | UTF8 | en_US.UTF-8 | en_US.UTF-8 | postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | postgres=CTc/postgres+ | | | | | =c/postgres (6 rows) postgres=# \connect engine psql (9.2.23, server 9.5.9) WARNING: psql version 9.2, server version 9.5. Some psql features might not work. You are now connected to database "engine" as user "postgres". engine=# 2. Now you should take information about unmanaged devices from vm table: engine=# select * from vm_device where type in ('disk', 'video', 'balloon', 'interface') and vm_id = 'ef79dedb-8405-4172-9d1d-1841d230f9fd' and is_managed = 'f'; 3. Shutdown your vm from oVirt web interface. 4. Delete all unmanaged devices from vm table: engine=# delete from vm_device where type in ('disk', 'video', 'balloon', 'interface') and vm_id = 'ef79dedb-8405-4172-9d1d-1841d230f9fd' and is_managed = 'f'; 5. Start your vm from oVirt web interface. It should boot fine now. (In reply to Andy Clark from comment #29) I doubt those steps would fix this bug: 1. disk/video/balloon/interfaces devices are never created as unmanaged devices 2. without attaching the unplugged disks to the VM, the VM is not supposed to boot (unless you configured another bootable device like a network interface) and even if it boots, it won't have its disks. > I doubt those steps would fix this bug: They should not. It is just workaround. Fixes should be in the code, I presume. > 1. disk/video/balloon/interfaces devices are never created as unmanaged devices It is true, but we not talking about how they were created, but about what happened with them after migration. > 2. without attaching the unplugged disks to the VM, the VM is not supposed to boot (unless you configured another bootable device like a network interface) and even if it boots, it won't have its disks. It is also true, but we do not actually detach them. It is about records in db, witch is duplicated and stayed in unmanaged state. (In reply to Andy Clark from comment #31) > > 1. disk/video/balloon/interfaces devices are never created as unmanaged devices > > It is true, but we not talking about how they were created, but about what > happened with them after migration. Right, I meant that on the engine side those devices are never supposed to be created as unmanaged devices - either it upon starting the VM or migrating it. For instance, when VDSM reports a disk that cannot be correlated with one of the devices the engine knows about, this device is not added as an unmanaged device but rather ignored [1]. > > > 2. without attaching the unplugged disks to the VM, the VM is not supposed to boot (unless you configured another bootable device like a network interface) and even if it boots, it won't have its disks. > > It is also true, but we do not actually detach them. It is about records in > db, witch is duplicated and stayed in unmanaged state. That's not accurate - it is true that the disk is not actually unplugged from the running VM, but on the engine side the disk is actually detached from the VM. We hold a relationship between each disk and the VM(s) that uses it - in case the disk's device is shown as unplugged, it means this relation is updated in a way that the disk is detached from the VM and therefore won't be part of the 'hardware' of the VM the next time it is started (unless you activate it). [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/libvirt/VmDevicesConverter.java#L432 This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. I just updated a 4.2.1 cluster with this issue to 4.2.2, and the VMs from before 4.2 are still in the same state - network unplugged and disk deactivated. (In reply to Chris Adams from comment #34) > I just updated a 4.2.1 cluster with this issue to 4.2.2, and the VMs from > before 4.2 are still in the same state - network unplugged and disk > deactivated. Hi Chris, Can you share logs. Francesco, should we file new BZ? (In reply to Israel Pinto from comment #35) > (In reply to Chris Adams from comment #34) > > I just updated a 4.2.1 cluster with this issue to 4.2.2, and the VMs from > > before 4.2 are still in the same state - network unplugged and disk > > deactivated. > > Hi Chris, > Can you share logs. > > Francesco, should we file new BZ? yes please - but with the logs (In reply to Francesco Romani from comment #36) > (In reply to Israel Pinto from comment #35) > > (In reply to Chris Adams from comment #34) > > > I just updated a 4.2.1 cluster with this issue to 4.2.2, and the VMs from > > > before 4.2 are still in the same state - network unplugged and disk > > > deactivated. > > > > Hi Chris, > > Can you share logs. > > > > Francesco, should we file new BZ? > > yes please - but with the logs Hi Chris, Can you share logs. Which logs specifically would you like? Do you want them on this bug, or should I go ahead and create a new one? (In reply to Chris Adams from comment #38) > Which logs specifically would you like? Do you want them on this bug, or > should I go ahead and create a new one? Please file a new bug. I'm afraid the VMs will need manual fixing like outlined by Arik previously (e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1542117#c32) AFAIK there is no way to automatically fix the issue. The good news is that nothing should be lost, it's just that Engine lost track of which device belonged to which VM. Let's talk in the new BZ about any needed log, depending on the specific problem reported it is possible we need no new logs. Before I open a new bug - I guess if the issue is not expected to be automatically fixed, then I misunderstood. My dev system was upgraded from 4.1 to 4.2.0, 4.2.1, and then 4.2.2, and I haven't rebooted all the VMs. Is the fix supposed to handle future 4.1.x->4.2.2 (or higher) upgrades correctly, or am I going to have to manually fix each VM? (In reply to Chris Adams from comment #40) > Before I open a new bug - I guess if the issue is not expected to be > automatically fixed, then I misunderstood. My dev system was upgraded from > 4.1 to 4.2.0, 4.2.1, and then 4.2.2, and I haven't rebooted all the VMs. Is > the fix supposed to handle future 4.1.x->4.2.2 (or higher) upgrades > correctly, or am I going to have to manually fix each VM? Vdsm is the component that caused the device to be disassociated from their VMs (it failed to report some key info to Engine). Unfortunately Vdsm cannot automatically fix this. So either Engine can, or people will need to fix this manually - or create some script to do that. Arik, any insight? (In reply to Francesco Romani from comment #41) > Vdsm is the component that caused the device to be disassociated from their > VMs (it failed to report some key info to Engine). Unfortunately Vdsm cannot > automatically fix this. So either Engine can, or people will need to fix > this manually - or create some script to do that. Arik, any insight? I agree with the above - at the moment, VMs whose devices were correlated with the third mechanism but had incorrect UUIDs need to be fixed manually or with a script (relatively simple script btw). We currently have 3 mechanisms for correlating the devices reported by VDSM with those that appear in the database: 1. By user-aliases - that's the best approach but it only applies to VMs that were started in recent versions of oVirt and recent versions of libvirt. 2. By device properties - that's what we use when user-aliases are not available in 4.2 clusters (e.g., on Centos 7.4). 3. By device UUIDs - that is intended for cluster levels lower than 4.2. It assumes VDSM reports devices with the UUIDs that were assigned by the engine. We have to use the third mechanism for VDSM <= 4.1 (that don't support dumpxmls). However, it may be possible to improve the second mechanism and then on cluster versions < 4.2 to try using dumpxmls and if it is supported to use it. It may be possible to fix 'corrupted' VMs that way. |