Description of problem: When trying to live-migrate a VM from a 4.2 host back to a 4.1 host the migration fails and the VM crashes with error: ERROR (jsonrpc/4) [virt.vm] (vmId='...') Alias not found for device type balloon during migration at destination host (vm:4631) Version-Release number of selected component (if applicable): How reproducible: source: vdsm-4.20.9-1.el7.centos.x86_64 destination: vdsm-4.19.31-1.el7.centos.x86_64 Steps to Reproduce: 1. update 4.1 host to 4.2 2. live migrate VM from existing 4.1 host to 4.2 3. live migrate VM back to 4.1 host Actual results: VM crashes on second migration Expected results: VM migrates fine both times as both hosts are using the same cluster version so VM definitions must match.
Possible mis-initialization of rng specparam on incoming migration(and possibly create) in 4.2 using vmconf format. That is needed for migration to <4.2 hosts later on. Missing logs for VM migration in step 2 and engine version and log, Evgheni, can you add that? Workaround is to remove virtio-rng device from VM (requires restart), or upgrade the destination host to 4.2. Decreasing Sev.
(In reply to Michal Skrivanek from comment #4) > Possible mis-initialization of rng specparam on incoming migration(and > possibly create) in 4.2 using vmconf format. That is needed for migration to > <4.2 hosts later on. > > Missing logs for VM migration in step 2 and engine version and log, Evgheni, > can you add that? > > Workaround is to remove virtio-rng device from VM (requires restart), or > upgrade the destination host to 4.2. Decreasing Sev. I don't think either workarounds are valid. IIRC, both require downtime.
Engine is 4.1.7, cluster compatibility level is 4.1. Will provide the logs ASAP. In any case, I believe VMs crashing during live migration is a bug, this shouldn't happen. Worst case the migration should fail with the VM remaining on source host.
(In reply to Yaniv Kaul from comment #5) > (In reply to Michal Skrivanek from comment #4) > > Possible mis-initialization of rng specparam on incoming migration(and > > possibly create) in 4.2 using vmconf format. That is needed for migration to > > <4.2 hosts later on. > > > > Missing logs for VM migration in step 2 and engine version and log, Evgheni, > > can you add that? > > > > Workaround is to remove virtio-rng device from VM (requires restart), or > > upgrade the destination host to 4.2. Decreasing Sev. > > I don't think either workarounds are valid. IIRC, both require downtime. I disagree. Urgent severity is affecting existing workloads in significant way without any workaround. Disabling migrations while completing the relatively simple host upgrade on the rest of the cluster sounds feasible enough to me and doesn't require downtime
(In reply to Evgheni Dereveanchin from comment #6) > Engine is 4.1.7, cluster compatibility level is 4.1. Will provide the logs > ASAP. In any case, I believe VMs crashing during live migration is a bug, > this shouldn't happen. Worst case the migration should fail with the VM > remaining on source host. Are you saying that on that migration attempt the source VM crashes? IIUC you "only" experience a failed migration while the VM still continues to run (at least somehwere)
Regarding #10 - what if there's a scheduling policy enabled on the cluster that will auto-migrate VMs to load balance the cluster? Or ovirt-optimizer if that still exists? I, for example, ran into this issue after hitting bz1522878 in an attempt to to evacuate VMs caused unexpected crashes. As for the migration, the VM is running on destination but not visible to VDSM - not sure what happens if I try to start it, hopefully VM reservations will kick in and not let the VM start (then it'll be stuck in "down" state on engine). Alternatively, the VM will start somewhere else and disk corruption will probably occur.
Evgheni, could you also please provide earlier engine.log? Seems tracking of that VM is also not entirely correct, so it's interesting to follow logs from the original VM start time Was it started on -03 originally? Where did engine think it runs when you triggered the migration - can you confirm/change that at that time that expected original host was _not_ running that VM?
The test VM was created today, then started on ovirt-srv01 (4.2 - started automatically as it has fewer VMs). It was then successfully migrated to ovirt-srv02. I then migrated it back to ovirt-srv01 which succeeded but once again triggered bz1522878 with VDSM restarting on ovirt-srv01. Then there were several unsuccessful migration attempts until eventually the VM went into "down" state on the engine.
(In reply to Evgheni Dereveanchin from comment #12) > As for the migration, the VM is running on destination but not visible to > VDSM I can confirm that I've seen a situation when the VM was reported as down in the web UI, while it was running on the destination. So it should be reproducible.
Migrating a VM without a Random Generator seems to work without problems. From what I recall from bz1337101, RNG is enabled by default on all VMs created after 4.0 As for VMs running in libvirt but invisible to VDSM, I have this situation right now and unsure if it's the same bug or a different one. Please advise if any more data is needed from my side to troubleshoot that.
Patch applied to affected host. Migrating a VM after VDSM restart succeeds now, however in the admin portal the VM shows up as having no graphical console after migration. Will open a new bug for that after more tests as the issue described in this BZ is now fixed.
Status: Engine: Software Version:4.2.1-0.0.master.20171211205712.git7b1f4d1.el7.centos vdsm 4.1: vdsm-4.19.42-1.el7ev.x86_64 vdsm 4.2: vdsm-4.20.9-30.gite026991.el7.centos Steps: the steps as Milan provide are: 1. Starting VM on 4.1 with configuration as describe in case 2. Migrating it to 4.2 3. Restarting Vdsm on 4.2 4. Migrating the VM back to 4.1 Cases and Results: 1. Run all the below VM in 4.2 host and set host to maintenance -PASS 2. Migration VM with snapshot from 4.1 to 4.2 -PASS 3 Migration VM with snapshot from 4.2 to 4.1 -PASS 4. Migrate VM with RNG (urandon) from 4.1 to 4.2 - PASS 5. Migrate VM with RNG (urandom) from 4.1 to 4.2 - PASS 6 Migrate VM with RNG (hwrng) from 4.1 to 4.2 - PASS 7. Migrate VM with RNG (hwrng) from 4.2 to 4.1 -PASS 8. Migration VM with hotplug memory and CPU from 4.1 to 4.2 - PASS 9. Migration VM with hotplug memory and CPU from 4.2 to 4.1 - PASS 10. Migrate VM with spice from 4.1 to 4.2 - PASS 11. Migrate VM with spice from 4.2 to 4.1 -PASS 12. Migrate VM with VNC from 4.1 to 4.2 - PASS 13. Migrate VM with VNC from 4.2 to 4.1 -PASS 14. Migration Vm in pause from 4.2 to 4.1 -PASS 15. Migration Vm in pause from 4.1 to 4.2 - PASS 16. Migrate VM with SRIOV nic configured from 4.1 to 4.2 - PASS - reported by mburman 17 Migrate VM with SRIOV nic configured from 4.2 to 4.1 - PASS - reported by mburman 18. Migrate VM with Direct LUN based disk from 4.1 to 4.2 - Failed to start vm BZ:1524941 19. Migrate VM with Direct LUN based disk from 4.2 to 4.1 - Failed to start vm BZ:1524941 20. Migrate Headless VM from 4.2 to 4.1 - PASS 21. Migrate Headless VM from 4.1 to 4.2 - PASS 22. Migrate smartcard VM from 4.1 to 4.2 - PASS 23. Migrate smartcard VM from 4.1 to 4.2 - PASS
Verify with: https://bugzilla.redhat.com/show_bug.cgi?id=1522901#c19
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.