+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1660071 +++ ====================================================================== Created attachment 1515044 [details] logs Description of problem: Regression in Migration of VM that starts in pause mode: took 11 hours. Version-Release number of selected component (if applicable): Engine Version: 4.2.8-0.1.el7ev OS Version: RHEL - 7.6 - 4.el7 LIBVIRT Version: libvirt-4.5.0-10.el7_6.3 VDSM Version: vdsm-4.20.43-1.el7 How reproducible: 100% Steps to Reproduce: 1. Create VM and set start in pause mode 2. Run VM 3. Migrate VM Actual results: Migration stuck and eventually finished after 11 hours. Expected results: Migration completed fine. Additional info: (Originally by pchocron)
The attached vdsm logs are from a different time period, they do not include the migration. The migration started at 2018-12-16 17:13:44,196+02 and ended at 2018-12-17 05:13:56,950+02. But the vdsm logs on the source host start at 2018-12-17 13:01:11,258+02 and on the destination host at 2018-12-17 14:01:01,646+02. (Originally by Andrej Krejcir)
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both (Originally by Ryan Barry)
Closing, since this has not been observed in months (Originally by Ryan Barry)
Created attachment 1551164 [details] Source logs right timeframe (Originally by Ryan Barry)
Created attachment 1551165 [details] Destination logs right timeframe (Originally by Ryan Barry)
Andrej, Milan, ideas? I was able to log into the enviroment to check (and get the right vdsm logs). The VM is up, and the vdsm logs from the destination host fire the appropriate JSONRPC event, but it's never received by the engine, so the VM is stuck in migration. It is actually up. Kicking vdsm to refresh VMs unstuck the engine, but it looks racey Polina, how did you reproduce manually? (Originally by Ryan Barry)
I reproduced it manually . The VM was stuck in migration a long period (Originally by Polina Agranat)
actually here is the event note: Migration completed (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, Duration: 6 hours 24 minutes 19 seconds, Total: 6 hours 24 minutes 27 seconds, Actual downtime: 5ms) (Originally by Polina Agranat)
and also, I'm reproducing it just now in the same env. Event ID 62 : Migration started (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, User: admin@internal-authz). migration is stuck. (Originally by Polina Agranat)
I can reproduce the bug. Vdsm doesn't detect end of the migration on the destination because it waits for a resume event from libvirt on the destination, which doesn't come in case the VM is paused (this may be a libvirt change and the cause of the regression). That means Vdsm keeps reporting Migration Destination status to Engine. A fix may be a bit delicate, since libvirt apparently doesn't send a specific migration-finished event in such a case. It sends more general events and we should be cautious not to confuse end of migration with something else. (Originally by Milan Zamazal)
Still reproducible? (Originally by Ryan Barry)
Created attachment 1623746 [details] migration of paused VM Hi Ryan, it is still reproducible on ovirt-engine-4.3.6.6-0.1.el7.noarch I attach the logs migration_of_paused_vm.tar.gz where you could see migration took 12 hr for golden_env_mixed_virtio_5, then succeeded. Sometimes for such configuration migration fails and the VM is down. 2019-10-08 21:38:11,005+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [92795e81-bcdf-4bd7-849f-18ee7ae5ccdc] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: golden_env_mixed_virtio_5, Source: host_mixed_1, Destination: host_mixed_3, User: admin@internal-authz). 2019-10-08 21:38:16,019+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [] Handing over VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) to Host '9fc15ca8-8494-4cb4-bc56-22ad80e341f4'. Setting VM to status 'MigratingTo' 2019-10-09 09:38:12,497+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) moved from 'MigratingTo' --> 'Paused' (Originally by Polina Agranat)
verified on 4.4.http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-23 by running virt/rhevmtests.compute.virt.reg_vms.vm_power_actions_test.TestSuspendVM.test_migrate_suspend_vm https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.4-ge-runner-compute/116/console I 4.3 the bug still happens (Originally by Polina Agranat)
Created attachment 1683361 [details] logs
verification on http://bob-dr.lab.eng.brq.redhat.com/builds/4.3/rhv-4.3.10-2 the first mogration passes. the second is completed with qemu crash Event in engine: Migration failed: Lost connection with qemu process (VM: pause_vm, Source: host_mixed_3). In the attached engine.log 2020-04-30 19:31:18,515+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] EVENT_ID: VM_MIGRATION_ABORT(64), Migration failed: Lost connection with qemu process (VM: pause_vm, Source: host_mixed_3).
I can reproduce it on both 4.3 and 4.4, QEMU stops on the second migration: qemu-kvm: Failed to load virtio_pci/modern_queue_state:avail qemu-kvm: Failed to load virtio_pci/modern_state:vqs qemu-kvm: Failed to load virtio/extra_state:extra_state qemu-kvm: Failed to load virtio-rng:virtio qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.2:00.0/virtio-rng' qemu-kvm: load of migration failed: Input/output error shutting down, reason=crashed I can also reproduce it using virsh with a reasonable minimal domain XML, I'll file a QEMU bug.
To summarize the current state: - The problem with paused VMs not being able to migrate at all has already been fixed a couple month ago. - The remaining problem is that a VM that has been started as paused and never switched to running can't be migrated more than once (until it's switched to running). This is a QEMU bug, which is not very easy to fix. Given that initially paused VMs that have been never switched to running are usually not migrated multiple times and that an easy workaround for such a situation is to simply power off the VM and start it again, this is not an urgent issue. Should we split the bug to handle the two issues separately or should we keep it as is (and retarget it etc. accordingly)?
(In reply to Milan Zamazal from comment #39) > To summarize the current state: > > - The problem with paused VMs not being able to migrate at all has already > been fixed a couple month ago. > > - The remaining problem is that a VM that has been started as paused and > never switched to running can't be migrated more than once (until it's > switched to running). This is a QEMU bug, which is not very easy to fix. > Given that initially paused VMs that have been never switched to running are > usually not migrated multiple times and that an easy workaround for such a > situation is to simply power off the VM and start it again, this is not an > urgent issue. > > Should we split the bug to handle the two issues separately or should we > keep it as is (and retarget it etc. accordingly)? I moved the double migration QEMU problem to a separate bug, https://bugzilla.redhat.com/1887391, to not block the general fix of paused VM migrations.
verified on ovirt-engine-4.3.11.4-0.1.el7.noarch vdsm-4.30.50-1.el7ev.x86_64 first migration of paused VM works. the second migration fails but we have a separate bug for it https://bugzilla.redhat.com/1887391. So, I can verify this
This shipped in https://access.redhat.com/errata/RHBA-2020:5217
Needs to be verified in 4.4. build. reopening
eh, TM mess. This has been tested in comment #41 and shipped in 4.3.11 via https://access.redhat.com/errata/RHBA-2020:4113
(In reply to Lukas Svaty from comment #43) > Needs to be verified in 4.4. build. reopening And bz 1660071 was verified on 4.4