1819721 – Regression in Migration of VM that starts in pause mode: took 11 hours

Bug 1819721 - Regression in Migration of VM that starts in pause mode: took 11 hours

Summary: Regression in Migration of VM that starts in pause mode: took 11 hours

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.8
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.4.3
Target Release:	---
Assignee:	Milan Zamazal
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:	Automation, Regression
Depends On:	1660071 1831103
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-01 11:33 UTC by RHV bug bot
Modified:	2020-11-26 13:48 UTC (History)
CC List:	13 users (show)
Fixed In Version:	vdsm-4.30.45
Doc Type:	Bug Fix
Doc Text:	Previously, when migrating a paused virtual machine, the migration status was sometimes shown as still running long after it was completed. In this release, the migration completion is recognized immediately and the status is reported correctly.
Clone Of:	1660071
Environment:
Last Closed:	2020-11-26 13:45:16 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (967.86 KB, application/gzip) 2020-04-30 16:48 UTC, Polina	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	106477	0	master	MERGED	virt: Handle incoming migrations of paused VMs	2021-02-01 08:07:46 UTC
oVirt gerrit	108179	0	ovirt-4.3	MERGED	virt: Handle incoming migrations of paused VMs	2021-02-01 08:07:46 UTC

Description RHV bug bot 2020-04-01 11:33:14 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1660071 +++
======================================================================

Created attachment 1515044 [details]
logs

Description of problem:
Regression in Migration of VM that starts in pause mode: took 11 hours.

Version-Release number of selected component (if applicable):
Engine Version: 4.2.8-0.1.el7ev
OS Version: RHEL - 7.6 - 4.el7
LIBVIRT Version: libvirt-4.5.0-10.el7_6.3
VDSM Version: vdsm-4.20.43-1.el7

How reproducible:
100%

Steps to Reproduce:
1. Create VM and set start in pause mode
2. Run VM
3. Migrate VM

Actual results:
Migration stuck and eventually finished after 11 hours.

Expected results:
Migration completed fine.

Additional info:

(Originally by pchocron)

Comment 2 RHV bug bot 2020-04-01 11:33:45 UTC

The attached vdsm logs are from a different time period, they do not include the migration.

The migration started at 2018-12-16 17:13:44,196+02 and ended at 2018-12-17 05:13:56,950+02.
But the vdsm logs on the source host start at 2018-12-17 13:01:11,258+02 and on the destination host at 2018-12-17 14:01:01,646+02.

(Originally by Andrej Krejcir)

Comment 5 RHV bug bot 2020-04-01 11:33:54 UTC

The attached vdsm logs are from a different time period, they do not include the migration.

The migration started at 2018-12-16 17:13:44,196+02 and ended at 2018-12-17 05:13:56,950+02.
But the vdsm logs on the source host start at 2018-12-17 13:01:11,258+02 and on the destination host at 2018-12-17 14:01:01,646+02.

(Originally by Andrej Krejcir)

Comment 6 RHV bug bot 2020-04-01 11:33:58 UTC

Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

(Originally by Ryan Barry)

Comment 9 RHV bug bot 2020-04-01 11:34:21 UTC

Closing, since this has not been observed in months

(Originally by Ryan Barry)

Comment 12 RHV bug bot 2020-04-01 11:34:35 UTC

Created attachment 1551164 [details]
Source logs right timeframe

(Originally by Ryan Barry)

Comment 13 RHV bug bot 2020-04-01 11:34:42 UTC

Created attachment 1551165 [details]
Destination logs right timeframe

(Originally by Ryan Barry)

Comment 14 RHV bug bot 2020-04-01 11:34:45 UTC

Andrej, Milan, ideas?

I was able to log into the enviroment to check (and get the right vdsm logs). The VM is up, and the vdsm logs from the destination host fire the appropriate JSONRPC event, but it's never received by the engine, so the VM is stuck in migration.

It is actually up. Kicking vdsm to refresh VMs unstuck the engine, but it looks racey 

Polina, how did you reproduce manually?

(Originally by Ryan Barry)

Comment 15 RHV bug bot 2020-04-01 11:34:48 UTC

I reproduced it manually . The VM was stuck in migration a long period

(Originally by Polina Agranat)

Comment 16 RHV bug bot 2020-04-01 11:34:55 UTC

actually here is the event note:
Migration completed (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, Duration: 6 hours 24 minutes 19 seconds, Total: 6 hours 24 minutes 27 seconds, Actual downtime: 5ms)

(Originally by Polina Agranat)

Comment 17 RHV bug bot 2020-04-01 11:35:04 UTC

and also, I'm reproducing it just now in the same env.
Event ID 62 : Migration started (VM: golden_env_mixed_virtio_1_0, Source: host_mixed_3, Destination: host_mixed_2, User: admin@internal-authz). 

migration is stuck.

(Originally by Polina Agranat)

Comment 18 RHV bug bot 2020-04-01 11:35:07 UTC

I can reproduce the bug. Vdsm doesn't detect end of the migration on the destination because it waits for a resume event from libvirt on the destination, which doesn't come in case the VM is paused (this may be a libvirt change and the cause of the regression). That means Vdsm keeps reporting Migration Destination status to Engine.

A fix may be a bit delicate, since libvirt apparently doesn't send a specific migration-finished event in such a case. It sends more general events and we should be cautious not to confuse end of migration with something else.

(Originally by Milan Zamazal)

Comment 19 RHV bug bot 2020-04-01 11:35:10 UTC

Still reproducible?

(Originally by Ryan Barry)

Comment 20 RHV bug bot 2020-04-01 11:35:13 UTC

Created attachment 1623746 [details]
migration of paused VM

Hi Ryan, it is still reproducible on ovirt-engine-4.3.6.6-0.1.el7.noarch

I attach the logs migration_of_paused_vm.tar.gz where you could see migration took 12 hr for golden_env_mixed_virtio_5, then succeeded.
Sometimes for such configuration migration fails and the VM is down.

2019-10-08 21:38:11,005+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [92795e81-bcdf-4bd7-849f-18ee7ae5ccdc] EVENT_ID: VM_MIGRATION_START(62), Migration started (VM: golden_env_mixed_virtio_5, Source: host_mixed_1, Destination: host_mixed_3, User: admin@internal-authz).
2019-10-08 21:38:16,019+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [] Handing over VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) to Host '9fc15ca8-8494-4cb4-bc56-22ad80e341f4'. Setting VM to status 'MigratingTo'


2019-10-09 09:38:12,497+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '838d7e75-125b-40e7-8c54-c99efd71f5b5'(golden_env_mixed_virtio_5) moved from 'MigratingTo' --> 'Paused'

(Originally by Polina Agranat)

Comment 21 RHV bug bot 2020-04-01 11:35:20 UTC

verified on 4.4.http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-23

by running virt/rhevmtests.compute.virt.reg_vms.vm_power_actions_test.TestSuspendVM.test_migrate_suspend_vm

https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.4-ge-runner-compute/116/console

I 4.3 the bug still happens

(Originally by Polina Agranat)

Comment 30 Polina 2020-04-30 16:48:50 UTC

Created attachment 1683361 [details]
logs

Comment 31 Polina 2020-04-30 16:49:25 UTC

verification on http://bob-dr.lab.eng.brq.redhat.com/builds/4.3/rhv-4.3.10-2

the first mogration passes. the second is completed with qemu crash

Event in engine: Migration failed: Lost connection with qemu process (VM: pause_vm, Source: host_mixed_3).

In the attached engine.log 

2020-04-30 19:31:18,515+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] EVENT_ID: VM_MIGRATION_ABORT(64), Migration failed: Lost connection with qemu process (VM: pause_vm, Source: host_mixed_3).

Comment 32 Milan Zamazal 2020-05-04 13:26:37 UTC

I can reproduce it on both 4.3 and 4.4, QEMU stops on the second migration:

  qemu-kvm: Failed to load virtio_pci/modern_queue_state:avail
  qemu-kvm: Failed to load virtio_pci/modern_state:vqs
  qemu-kvm: Failed to load virtio/extra_state:extra_state
  qemu-kvm: Failed to load virtio-rng:virtio
  qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.2:00.0/virtio-rng'
  qemu-kvm: load of migration failed: Input/output error
  shutting down, reason=crashed

I can also reproduce it using virsh with a reasonable minimal domain XML, I'll file a QEMU bug.

Comment 39 Milan Zamazal 2020-06-29 14:04:47 UTC

To summarize the current state:

- The problem with paused VMs not being able to migrate at all has already been fixed a couple month ago.

- The remaining problem is that a VM that has been started as paused and never switched to running can't be migrated more than once (until it's switched to running). This is a QEMU bug, which is not very easy to fix. Given that initially paused VMs that have been never switched to running are usually not migrated multiple times and that an easy workaround for such a situation is to simply power off the VM and start it again, this is not an urgent issue.

Should we split the bug to handle the two issues separately or should we keep it as is (and retarget it etc. accordingly)?

Comment 40 Milan Zamazal 2020-10-12 10:46:39 UTC

(In reply to Milan Zamazal from comment #39)
> To summarize the current state:
> 
> - The problem with paused VMs not being able to migrate at all has already
> been fixed a couple month ago.
> 
> - The remaining problem is that a VM that has been started as paused and
> never switched to running can't be migrated more than once (until it's
> switched to running). This is a QEMU bug, which is not very easy to fix.
> Given that initially paused VMs that have been never switched to running are
> usually not migrated multiple times and that an easy workaround for such a
> situation is to simply power off the VM and start it again, this is not an
> urgent issue.
> 
> Should we split the bug to handle the two issues separately or should we
> keep it as is (and retarget it etc. accordingly)?

I moved the double migration QEMU problem to a separate bug, https://bugzilla.redhat.com/1887391, to not block the general fix of paused VM migrations.

Comment 41 Polina 2020-10-14 09:01:40 UTC

verified on 
ovirt-engine-4.3.11.4-0.1.el7.noarch
vdsm-4.30.50-1.el7ev.x86_64

first migration of paused VM works. the second migration fails but we have a separate bug for it https://bugzilla.redhat.com/1887391. So, I can verify this

Comment 42 Michal Skrivanek 2020-11-26 12:09:49 UTC

This shipped in https://access.redhat.com/errata/RHBA-2020:5217

Comment 43 Lukas Svaty 2020-11-26 12:17:39 UTC

Needs to be verified in 4.4. build. reopening

Comment 44 Michal Skrivanek 2020-11-26 13:43:58 UTC

eh, TM mess.

This has been tested in comment #41 and shipped in 4.3.11 via https://access.redhat.com/errata/RHBA-2020:4113

Comment 45 Arik 2020-11-26 13:48:22 UTC

(In reply to Lukas Svaty from comment #43)
> Needs to be verified in 4.4. build. reopening

And bz 1660071 was verified on 4.4

Note You need to log in before you can comment on or make changes to this bug.