Bug 1660071

Summary: Regression in Migration of VM that starts in pause mode: took 11 hours
Product: Red Hat Enterprise Virtualization Manager Reporter: Pedut <pchocron>
Component: vdsmAssignee: Milan Zamazal <mzamazal>
Status: CLOSED ERRATA QA Contact: Polina <pagranat>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2.8CC: akrejcir, lsurette, mavital, michal.skrivanek, mtessun, mzamazal, pagranat, srevivo, ycui
Target Milestone: ovirt-4.4.0Keywords: Automation, Regression, Reopened, ZStream
Target Release: 4.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: Automation, Regression
Fixed In Version: vdsm-4.40.5 Doc Type: Bug Fix
Doc Text:
Previously, when migrating a paused virtual machine, the Red Hat Virtualization Manager did not always recognize that the migration completed. With this update, the Manager immediately recognizes when migration is complete.
Story Points: ---
Clone Of:
: 1819721 (view as bug list) Environment:
Last Closed: 2020-08-04 13:16:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1819721    
Attachments:
Description Flags
logs
none
Source logs right timeframe
none
Destination logs right timeframe
none
migration of paused VM none

Description Pedut 2018-12-17 12:55:26 UTC
Created attachment 1515044 [details]
logs

Description of problem:
Regression in Migration of VM that starts in pause mode: took 11 hours.

Version-Release number of selected component (if applicable):
Engine Version: 4.2.8-0.1.el7ev
OS Version: RHEL - 7.6 - 4.el7
LIBVIRT Version: libvirt-4.5.0-10.el7_6.3
VDSM Version: vdsm-4.20.43-1.el7

How reproducible:
100%

Steps to Reproduce:
1. Create VM and set start in pause mode
2. Run VM
3. Migrate VM

Actual results:
Migration stuck and eventually finished after 11 hours.

Expected results:
Migration completed fine.

Additional info:

Comment 2 Andrej Krejcir 2019-01-02 10:52:18 UTC
The attached vdsm logs are from a different time period, they do not include the migration.

The migration started at 2018-12-16 17:13:44,196+02 and ended at 2018-12-17 05:13:56,950+02.
But the vdsm logs on the source host start at 2018-12-17 13:01:11,258+02 and on the destination host at 2018-12-17 14:01:01,646+02.

Comment 3 Ryan Barry 2019-01-21 14:54:27 UTC
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

Comment 6 Ryan Barry 2019-03-19 11:40:42 UTC
Closing, since this has not been observed in months

Comment 9 Ryan Barry 2019-04-02 22:02:57 UTC
Created attachment 1551164 [details]
Source logs right timeframe

Comment 10 Ryan Barry 2019-04-02 22:03:31 UTC
Created attachment 1551165 [details]
Destination logs right timeframe

Comment 11 Ryan Barry 2019-04-02 22:11:58 UTC
Andrej, Milan, ideas?

I was able to log into the enviroment to check (and get the right vdsm logs). The VM is up, and the vdsm logs from the destination host fire the appropriate JSONRPC event, but it's never received by the engine, so the VM is stuck in migration.

It is actually up. Kicking vdsm to refresh VMs unstuck the engine, but it looks racey 

Polina, how did you reproduce manually?

Comment 12 Polina 2019-04-03 06:35:49 UTC
I reproduced it manually . The VM was stuck in migration a long period

Comment 13 Polina 2019-04-03 06:38:29 UTC
actually here is the event note:
Migration completed (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, Duration: 6 hours 24 minutes 19 seconds, Total: 6 hours 24 minutes 27 seconds, Actual downtime: 5ms)

Comment 14 Polina 2019-04-03 06:42:50 UTC
and also, I'm reproducing it just now in the same env.
Event ID 62 : Migration started (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, User: admin@internal-authz). 

migration is stuck.

Comment 15 Milan Zamazal 2019-04-03 09:55:02 UTC
I can reproduce the bug. Vdsm doesn't detect end of the migration on the destination because it waits for a resume event from libvirt on the destination, which doesn't come in case the VM is paused (this may be a libvirt change and the cause of the regression). That means Vdsm keeps reporting Migration Destination status to Engine.

A fix may be a bit delicate, since libvirt apparently doesn't send a specific migration-finished event in such a case. It sends more general events and we should be cautious not to confuse end of migration with something else.

Comment 16 Ryan Barry 2019-10-08 18:16:37 UTC
Still reproducible?

Comment 17 Polina 2019-10-09 08:38:34 UTC
Created attachment 1623746 [details]
migration of paused VM

Hi Ryan, it is still reproducible on ovirt-engine-4.3.6.6-0.1.el7.noarch

I attach the logs migration_of_paused_vm.tar.gz where you could see migration took 12 hr for golden_env_mixed_virtio_5, then succeeded.
Sometimes for such configuration migration fails and the VM is down.

2019-10-08 21:38:11,005+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [92795e81-bcdf-4bd7-849f-18ee7ae5ccdc] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: golden_env_mixed_virtio_5, Source: host_mixed_1, Destination: host_mixed_3, User: admin@internal-authz).
2019-10-08 21:38:16,019+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [] Handing over VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) to Host '9fc15ca8-8494-4cb4-bc56-22ad80e341f4'. Setting VM to status 'MigratingTo'


2019-10-09 09:38:12,497+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) moved from 'MigratingTo' --> 'Paused'

Comment 18 Polina 2020-03-18 16:18:16 UTC
verified on 4.4.http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-23

by running virt/rhevmtests.compute.virt.reg_vms.vm_power_actions_test.TestSuspendVM.test_migrate_suspend_vm

https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.4-ge-runner-compute/116/console

I 4.3 the bug still happens

Comment 25 errata-xmlrpc 2020-08-04 13:16:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247