Bug 1660071
Summary: | Regression in Migration of VM that starts in pause mode: took 11 hours | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Pedut <pchocron> | ||||||||||
Component: | vdsm | Assignee: | Milan Zamazal <mzamazal> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Polina <pagranat> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 4.2.8 | CC: | akrejcir, lsurette, mavital, michal.skrivanek, mtessun, mzamazal, pagranat, srevivo, ycui | ||||||||||
Target Milestone: | ovirt-4.4.0 | Keywords: | Automation, Regression, Reopened, ZStream | ||||||||||
Target Release: | 4.4.0 | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | Automation, Regression | ||||||||||||
Fixed In Version: | vdsm-4.40.5 | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
Previously, when migrating a paused virtual machine, the Red Hat Virtualization Manager did not always recognize that the migration completed. With this update, the Manager immediately recognizes when migration is complete.
|
Story Points: | --- | ||||||||||
Clone Of: | |||||||||||||
: | 1819721 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2020-08-04 13:16:51 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1819721 | ||||||||||||
Attachments: |
|
The attached vdsm logs are from a different time period, they do not include the migration. The migration started at 2018-12-16 17:13:44,196+02 and ended at 2018-12-17 05:13:56,950+02. But the vdsm logs on the source host start at 2018-12-17 13:01:11,258+02 and on the destination host at 2018-12-17 14:01:01,646+02. Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both Closing, since this has not been observed in months Created attachment 1551164 [details]
Source logs right timeframe
Created attachment 1551165 [details]
Destination logs right timeframe
Andrej, Milan, ideas? I was able to log into the enviroment to check (and get the right vdsm logs). The VM is up, and the vdsm logs from the destination host fire the appropriate JSONRPC event, but it's never received by the engine, so the VM is stuck in migration. It is actually up. Kicking vdsm to refresh VMs unstuck the engine, but it looks racey Polina, how did you reproduce manually? I reproduced it manually . The VM was stuck in migration a long period actually here is the event note: Migration completed (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, Duration: 6 hours 24 minutes 19 seconds, Total: 6 hours 24 minutes 27 seconds, Actual downtime: 5ms) and also, I'm reproducing it just now in the same env. Event ID 62 : Migration started (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, User: admin@internal-authz). migration is stuck. I can reproduce the bug. Vdsm doesn't detect end of the migration on the destination because it waits for a resume event from libvirt on the destination, which doesn't come in case the VM is paused (this may be a libvirt change and the cause of the regression). That means Vdsm keeps reporting Migration Destination status to Engine. A fix may be a bit delicate, since libvirt apparently doesn't send a specific migration-finished event in such a case. It sends more general events and we should be cautious not to confuse end of migration with something else. Still reproducible? Created attachment 1623746 [details]
migration of paused VM
Hi Ryan, it is still reproducible on ovirt-engine-4.3.6.6-0.1.el7.noarch
I attach the logs migration_of_paused_vm.tar.gz where you could see migration took 12 hr for golden_env_mixed_virtio_5, then succeeded.
Sometimes for such configuration migration fails and the VM is down.
2019-10-08 21:38:11,005+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [92795e81-bcdf-4bd7-849f-18ee7ae5ccdc] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: golden_env_mixed_virtio_5, Source: host_mixed_1, Destination: host_mixed_3, User: admin@internal-authz).
2019-10-08 21:38:16,019+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [] Handing over VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) to Host '9fc15ca8-8494-4cb4-bc56-22ad80e341f4'. Setting VM to status 'MigratingTo'
2019-10-09 09:38:12,497+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) moved from 'MigratingTo' --> 'Paused'
verified on 4.4.http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-23 by running virt/rhevmtests.compute.virt.reg_vms.vm_power_actions_test.TestSuspendVM.test_migrate_suspend_vm https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.4-ge-runner-compute/116/console I 4.3 the bug still happens Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247 |
Created attachment 1515044 [details] logs Description of problem: Regression in Migration of VM that starts in pause mode: took 11 hours. Version-Release number of selected component (if applicable): Engine Version: 4.2.8-0.1.el7ev OS Version: RHEL - 7.6 - 4.el7 LIBVIRT Version: libvirt-4.5.0-10.el7_6.3 VDSM Version: vdsm-4.20.43-1.el7 How reproducible: 100% Steps to Reproduce: 1. Create VM and set start in pause mode 2. Run VM 3. Migrate VM Actual results: Migration stuck and eventually finished after 11 hours. Expected results: Migration completed fine. Additional info: