Bug 1660071 - Regression in Migration of VM that starts in pause mode: took 11 hours
Summary: Regression in Migration of VM that starts in pause mode: took 11 hours
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.2.8
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.4.0
: 4.4.0
Assignee: Milan Zamazal
QA Contact: Polina
URL:
Whiteboard: Automation, Regression
Depends On:
Blocks: 1819721
TreeView+ depends on / blocked
 
Reported: 2018-12-17 12:55 UTC by Pedut
Modified: 2020-08-04 13:17 UTC (History)
9 users (show)

Fixed In Version: vdsm-4.40.5
Doc Type: Bug Fix
Doc Text:
Previously, when migrating a paused virtual machine, the Red Hat Virtualization Manager did not always recognize that the migration completed. With this update, the Manager immediately recognizes when migration is complete.
Clone Of:
: 1819721 (view as bug list)
Environment:
Last Closed: 2020-08-04 13:16:51 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (953.74 KB, application/x-xz)
2018-12-17 12:55 UTC, Pedut
no flags Details
Source logs right timeframe (455.01 KB, application/x-xz)
2019-04-02 22:02 UTC, Ryan Barry
no flags Details
Destination logs right timeframe (447.02 KB, application/x-xz)
2019-04-02 22:03 UTC, Ryan Barry
no flags Details
migration of paused VM (1.04 MB, application/gzip)
2019-10-09 08:38 UTC, Polina
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:3247 0 None None None 2020-08-04 13:17:21 UTC
oVirt gerrit 106477 0 master MERGED virt: Handle incoming migrations of paused VMs 2021-02-01 08:13:46 UTC

Description Pedut 2018-12-17 12:55:26 UTC
Created attachment 1515044 [details]
logs

Description of problem:
Regression in Migration of VM that starts in pause mode: took 11 hours.

Version-Release number of selected component (if applicable):
Engine Version: 4.2.8-0.1.el7ev
OS Version: RHEL - 7.6 - 4.el7
LIBVIRT Version: libvirt-4.5.0-10.el7_6.3
VDSM Version: vdsm-4.20.43-1.el7

How reproducible:
100%

Steps to Reproduce:
1. Create VM and set start in pause mode
2. Run VM
3. Migrate VM

Actual results:
Migration stuck and eventually finished after 11 hours.

Expected results:
Migration completed fine.

Additional info:

Comment 2 Andrej Krejcir 2019-01-02 10:52:18 UTC
The attached vdsm logs are from a different time period, they do not include the migration.

The migration started at 2018-12-16 17:13:44,196+02 and ended at 2018-12-17 05:13:56,950+02.
But the vdsm logs on the source host start at 2018-12-17 13:01:11,258+02 and on the destination host at 2018-12-17 14:01:01,646+02.

Comment 3 Ryan Barry 2019-01-21 14:54:27 UTC
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

Comment 6 Ryan Barry 2019-03-19 11:40:42 UTC
Closing, since this has not been observed in months

Comment 9 Ryan Barry 2019-04-02 22:02:57 UTC
Created attachment 1551164 [details]
Source logs right timeframe

Comment 10 Ryan Barry 2019-04-02 22:03:31 UTC
Created attachment 1551165 [details]
Destination logs right timeframe

Comment 11 Ryan Barry 2019-04-02 22:11:58 UTC
Andrej, Milan, ideas?

I was able to log into the enviroment to check (and get the right vdsm logs). The VM is up, and the vdsm logs from the destination host fire the appropriate JSONRPC event, but it's never received by the engine, so the VM is stuck in migration.

It is actually up. Kicking vdsm to refresh VMs unstuck the engine, but it looks racey 

Polina, how did you reproduce manually?

Comment 12 Polina 2019-04-03 06:35:49 UTC
I reproduced it manually . The VM was stuck in migration a long period

Comment 13 Polina 2019-04-03 06:38:29 UTC
actually here is the event note:
Migration completed (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, Duration: 6 hours 24 minutes 19 seconds, Total: 6 hours 24 minutes 27 seconds, Actual downtime: 5ms)

Comment 14 Polina 2019-04-03 06:42:50 UTC
and also, I'm reproducing it just now in the same env.
Event ID 62 : Migration started (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, User: admin@internal-authz). 

migration is stuck.

Comment 15 Milan Zamazal 2019-04-03 09:55:02 UTC
I can reproduce the bug. Vdsm doesn't detect end of the migration on the destination because it waits for a resume event from libvirt on the destination, which doesn't come in case the VM is paused (this may be a libvirt change and the cause of the regression). That means Vdsm keeps reporting Migration Destination status to Engine.

A fix may be a bit delicate, since libvirt apparently doesn't send a specific migration-finished event in such a case. It sends more general events and we should be cautious not to confuse end of migration with something else.

Comment 16 Ryan Barry 2019-10-08 18:16:37 UTC
Still reproducible?

Comment 17 Polina 2019-10-09 08:38:34 UTC
Created attachment 1623746 [details]
migration of paused VM

Hi Ryan, it is still reproducible on ovirt-engine-4.3.6.6-0.1.el7.noarch

I attach the logs migration_of_paused_vm.tar.gz where you could see migration took 12 hr for golden_env_mixed_virtio_5, then succeeded.
Sometimes for such configuration migration fails and the VM is down.

2019-10-08 21:38:11,005+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [92795e81-bcdf-4bd7-849f-18ee7ae5ccdc] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: golden_env_mixed_virtio_5, Source: host_mixed_1, Destination: host_mixed_3, User: admin@internal-authz).
2019-10-08 21:38:16,019+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [] Handing over VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) to Host '9fc15ca8-8494-4cb4-bc56-22ad80e341f4'. Setting VM to status 'MigratingTo'


2019-10-09 09:38:12,497+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) moved from 'MigratingTo' --> 'Paused'

Comment 18 Polina 2020-03-18 16:18:16 UTC
verified on 4.4.http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-23

by running virt/rhevmtests.compute.virt.reg_vms.vm_power_actions_test.TestSuspendVM.test_migrate_suspend_vm

https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.4-ge-runner-compute/116/console

I 4.3 the bug still happens

Comment 25 errata-xmlrpc 2020-08-04 13:16:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247


Note You need to log in before you can comment on or make changes to this bug.