1688159 – Migration aborted, probably due to stalling

Bug 1688159 - Migration aborted, probably due to stalling

Summary: Migration aborted, probably due to stalling

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.1
Target Release:	4.3.1
Assignee:	Milan Zamazal
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1711792
TreeView+	depends on / blocked

Reported:	2019-03-13 10:05 UTC by Nils Koenig
Modified:	2023-03-24 14:37 UTC (History)
CC List:	18 users (show)
Fixed In Version:	vdsm-4.30.15
Doc Type:	Bug Fix
Doc Text:	Previously, when a virtual machine migration entered post-copy mode and remained in that mode for a long time, the migration sometimes failed and the migrated virtual machine was powered off. In this release, post-copy migrations are maintained to completion.
Clone Of:
Clones:	1711792 (view as bug list)
Environment:
Last Closed:	2020-08-04 13:26:25 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)
Log, VM config and screenshots (1.21 MB, application/zip) 2019-03-13 10:05 UTC, Nils Koenig	no flags	Details
Screenshot of Error Messages (93.53 KB, image/png) 2019-03-18 10:46 UTC, Nils Koenig	no flags	Details
vdsm.log during migration (1.25 MB, application/zip) 2019-03-18 10:47 UTC, Nils Koenig	no flags	Details
lu0540 vdsm.log (451.45 KB, application/x-xz) 2019-03-18 16:30 UTC, Nils Koenig	no flags	Details
ovirt engine log (1.57 MB, application/gzip) 2019-03-18 16:30 UTC, Nils Koenig	no flags	Details
lu0540 log files with debug logging on during Post-copy migration failure (3.41 MB, application/x-xz) 2019-03-20 11:14 UTC, Nils Koenig	no flags	Details
lu0541 Log files with debug logging on during Post-copy migration failure (11.37 MB, application/x-xz) 2019-03-20 11:14 UTC, Nils Koenig	no flags	Details
rhvm engine.log during Post-copy migration failure (281.20 KB, application/x-xz) 2019-03-20 11:15 UTC, Nils Koenig	no flags	Details
Debug log files during migrating with "Suspend Workload" policy (14.74 MB, application/x-xz) 2019-03-21 15:31 UTC, Nils Koenig	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4028881	Troubleshoot	None	Why VM gets crashed in the post-copy phase during live migration	2019-04-02 14:33:02 UTC
Red Hat Product Errata	RHEA-2020:3246	None	None	None	2020-08-04 13:26:51 UTC
oVirt gerrit	98792	'None'	MERGED	virt: Use libvirt memory iteration info in migrations	2021-01-09 07:28:33 UTC
oVirt gerrit	98793	'None'	MERGED	virt: Reset post-copy flag at the beginning of a migration	2021-01-09 07:27:58 UTC
oVirt gerrit	98794	'None'	MERGED	virt: Don't finish incoming migration when switching to post-copy	2021-01-09 07:27:58 UTC
oVirt gerrit	98795	'None'	MERGED	virt: Detect incoming post-copy migration in recovery	2021-01-09 07:28:33 UTC
oVirt gerrit	99098	'None'	MERGED	virt: Use libvirt memory iteration info in migrations	2021-01-09 07:27:56 UTC
oVirt gerrit	99099	'None'	MERGED	virt: Reset post-copy flag at the beginning of a migration	2021-01-09 07:27:53 UTC
oVirt gerrit	99100	'None'	MERGED	virt: Don't finish incoming migration when switching to post-copy	2021-01-09 07:27:55 UTC
oVirt gerrit	99101	'None'	MERGED	virt: Detect incoming post-copy migration in recovery	2021-01-09 07:27:58 UTC

Description Nils Koenig 2019-03-13 10:05:20 UTC

Created attachment 1543575 [details]
Log, VM config and screenshots

Description of problem:
Migration a VM under load failed a few times.

How reproducible:
Test scenario is migrating a VM running a SAP HANA PBO workload (memory and CPU intensive load). Out of twelve migrations two failed and worked after retrying. 
Test scenario explicitly requires migration VM under load.

Log shows: 

2019-03-12 18:42:32,448+0100 WARN  (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Migration stalling: remaining (32143MiB) > lowmark (0MiB). (migration:837)
2019-03-12 18:42:32,448+0100 WARN  (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Aborting migration (migration:902)

Does the abort happen due to too high change rate?
Migration network is 10G. 

Could adjusting migration_max_time_per_gib_mem or migration_progress_timeout help? Please advise.

Comment 1 Nils Koenig 2019-03-13 12:19:22 UTC

Apparently the way to handle this would be migration profiles (https://ovirt.org/develop/release-management/features/virt/migration-enhancements.html). Is there a way to disable or modify the stall timeout?

Comment 2 Ryan Barry 2019-03-14 00:50:02 UTC

It's very likely to be a failure to converge due to change rate. Have you tried postcopy migrations?

Comment 3 Nils Koenig 2019-03-14 08:34:07 UTC

@Ryan Barry

Haven't been able to successfully change the migration policy.
The documentation I find does not tell how to do so: https://access.redhat.com/solutions/3143541

Here is what I tried using the ovirt python API:

# change policy
suspend_workload_policy=types.MigrationPolicy()
suspend_workload_policy.id="80554327-0569-496b-bdeb-fcbbf52b827c"
vm_to_migrate.migration.policy=suspend_workload_policy

But it seems to have no effect, from vdsm.log:

2019-03-13 20:08:17,658+0100 INFO  (jsonrpc/2) [api.virt] START migrate(params={u'incomingLimit': 2, u'src': u'lu0540.wdf.sap.corp', u'dstqemu': u'10.76.34.34', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'lu0541.wdf.sap.corp:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 5000, u'method': u'online'}) from=::ffff:10.20.90.165,59716, flow_id=3c9b1a2c-9dc7-41e8-b8b9-862b55d4b71a, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:46)

Where can I find guidance on how to change the migration policy?

Comment 4 Ryan Barry 2019-03-14 22:00:02 UTC

Hi Nils -

If you change:

stalling:
  - action
    - setDowntime...

To 'postcopy', you can give it a try. It's been a tech preview since 4.1, but is stable. There is a risk if the network goes down during the migration, which is why it's still tech preview, but it will almost always converge.

Comment 5 Nils Koenig 2019-03-18 10:46:22 UTC

Created attachment 1545183 [details]
Screenshot of Error Messages

Error messages shown in VM event log for migration policys: "Suspend workload if needed" and "Post-copy".

Comment 6 Nils Koenig 2019-03-18 10:47:21 UTC

Created attachment 1545184 [details]
vdsm.log during migration

Comment 7 Nils Koenig 2019-03-18 10:47:34 UTC

I've tried with "Suspend workload if needed" and "Post-copy migration" and both fail.

With "Suspend workload if needed" the VM has been migrated a couple of times back and forth but it fails with the following error:

2019-03-14 18:29:14,814+0100 INFO  (jsonrpc/1) [api.virt] START migrate(params={u'incomingLimit': 1, u'src': u'lu0541.wdf.sap.corp', u'dstqemu': u'10.76.34.32', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'lu0540.wdf.sap.corp:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7', u'abortOnError': u'true', u'outgoingLimit': 1, u'compressed': u'true', u'maxBandwidth': 5000, u'method': u'online'}) from=::ffff:10.20.90.165,52794, flow_id=9b2b8655-3ccd-4d7a-994b-7019e6bf0f33, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:46)
2019-03-14 18:29:14,815+0100 INFO  (jsonrpc/1) [api.virt] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::ffff:10.20.90.165,52794, flow_id=9b2b8655-3ccd-4d7a-994b-7019e6bf0f33, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:52)
2019-03-14 18:29:14,815+0100 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.migrate succeeded in 0.00 seconds (__init__:573)
2019-03-14 18:29:14,892+0100 INFO  (migsrc/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Creation of destination VM took: 0 seconds (migration:473)

2019-03-14 18:29:14,893+0100 ERROR (migsrc/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') migration destination error: Virtual machine already exists (migration:290)


Using "Post-copy migration":
Migration seems to fail according to VM event log (see screenshot) at 19:05:03.
In vdsm.log (attached) the migration seems to continue running e.g.

2019-03-14 20:00:51,761+0100 INFO  (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Migration Progress: 4412 seconds elapsed, 69% of data processed, total data: 524360MB, processed data: 392508MB, remaining data: 164858MB, transfer speed 799MBps, zero pages: 81266950MB, compressed: 0MB, dirty rate: 791924, memory iteration: 5 (migration:867)

but it does not finish or advance.

Comment 8 Ryan Barry 2019-03-18 11:18:53 UTC

Milan, any ideas here?

Comment 9 Milan Zamazal 2019-03-18 15:44:53 UTC

We may be dealing with multiple issues here.

Regarding "Suspend workload if needed" from Comment 7, the migration is attempted several seconds after the VM has been migrated in the opposite direction. The message "Virtual machine already exists" most likely means that VM cleanup hasn't been finished on the original host yet. I think Engine shouldn't initiate migration in such a case, but it's a non-fatal problem, it should start working after a short while.

Regarding "Post-copy migration", the stalling migration switches to post-copy, the VM gets suspended as expected and remaining data is being copied for a while. But then the progress gets stuck and Engine reports the given error at the same time. Apparently something happened on the destination or in Engine.

Nils, could you please provide the corresponding Vdsm log from the other host + the corresponding Engine log so that we can look further into what happened?

Comment 10 Nils Koenig 2019-03-18 16:28:44 UTC

I agree, those are two issues.

The quemu process is indeed still running:

[root@lu0541 ~]# ps aux | grep lu0569
qemu     184271 54.7  0.0 539747592 248600 ?    Sl   Mar14 3119:14 /usr/libexec/qemu-kvm -name guest=lu0569,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-8-lu0569/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu host,rdtscp=on,invtsc=on,tsc-frequency=2494140000,host-cache-info=on,l3-cache=off -m 524288 -realtime mlock=off -smp 216,sockets=4,cores=27,threads=2 -object iothread,id=iothread1 -mem-prealloc -mem-path /dev/hugepages/libvirt/qemu/8-lu0569 -numa node,nodeid=0,cpus=0-215,mem=524288 -uuid 8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.6-4.el7,serial=4c4c4544-0033-3510-8050-c7c04f595232,uuid=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=38,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device virtio-scsi-pci,iothread=iothread1,id=ua-ee65335b-0773-4900-8277-e4f4a551c13f,bus=pci.0,addr=0x3 -device virtio-serial-pci,id=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ua-bf8317f4-db0f-4f75-909f-ae288ec106ba,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-bf8317f4-db0f-4f75-909f-ae288ec106ba,id=ua-bf8317f4-db0f-4f75-909f-ae288ec106ba -drive file=/rhev/data-center/mnt/blockSD/944da68a-ae00-4e82-9273-84456e3fd17c/images/c3931013-1c03-4dd1-981e-99032844b934/4b9fae5b-127e-4438-9f29-13bf52f7d9b7,format=raw,if=none,id=drive-ua-c3931013-1c03-4dd1-981e-99032844b934,serial=c3931013-1c03-4dd1-981e-99032844b934,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x6,drive=drive-ua-c3931013-1c03-4dd1-981e-99032844b934,id=ua-c3931013-1c03-4dd1-981e-99032844b934,bootindex=1,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000007,format=raw,if=none,id=drive-ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,id=ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000006,format=raw,if=none,id=drive-ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x8,drive=drive-ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,id=ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000017,format=raw,if=none,id=drive-ua-f5d29ab8-7816-4223-89db-0bf831a4d032,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x9,drive=drive-ua-f5d29ab8-7816-4223-89db-0bf831a4d032,id=ua-f5d29ab8-7816-4223-89db-0bf831a4d032,write-cache=on -drive file=/dev/shm/vhostmd0,format=raw,if=none,id=drive-virtio-disk701,readonly=on -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk701,id=virtio-disk701 -netdev tap,fds=39:40:41:36,id=hostua-75c0c780-2642-47ff-b70f-439315b1e493,vhost=on,vhostfds=42:43:44:45 -device virtio-net-pci,mq=on,vectors=10,host_mtu=1500,netdev=hostua-75c0c780-2642-47ff-b70f-439315b1e493,id=ua-75c0c780-2642-47ff-b70f-439315b1e493,mac=00:1a:4a:16:01:01,bus=pci.0,addr=0xc -chardev socket,id=charchannel0,fd=46,server,nowait -device virtserialport,bus=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=47,server,nowait -device virtserialport,bus=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -vnc 10.76.34.34:0,password -k de -device qxl-vga,id=ua-a1893d17-db06-4250-b46e-b5ab0e0f91a8,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -incoming defer -object rng-random,id=objua-9adbf954-a5b0-4533-9458-8edec55820a4,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9adbf954-a5b0-4533-9458-8edec55820a4,id=ua-9adbf954-a5b0-4533-9458-8edec55820a4,bus=pci.0,addr=0xa -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on


But it can't reached via network nor console - it appears offline in RHVM.

Please find the requested logs attached.

Comment 11 Nils Koenig 2019-03-18 16:30:07 UTC

Created attachment 1545379 [details]
lu0540 vdsm.log

Comment 12 Nils Koenig 2019-03-18 16:30:36 UTC

Created attachment 1545380 [details]
ovirt engine log

Comment 13 Milan Zamazal 2019-03-19 12:06:25 UTC

Thank you Nils for the logs. The problem with post-copy is this:

2019-03-14 19:04:19,249+0100 ERROR (vm/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') The vm start process failed (vm:949)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 893, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4162, in _completeIncomingMigration
    self._attachLibvirtDomainAfterMigration(finished, timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4227, in _attachLibvirtDomainAfterMigration
    self._sync_metadata()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5253, in _sync_metadata
    self._md_desc.dump(self._dom)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/metadata.py", line 514, in dump
    0)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 106, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
2019-03-14 19:04:19,249+0100 INFO  (vm/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Changed state to Down: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) (code=1) (vm:1689)

When the VM is started on the destination after switching to post-copy and while still migrating, Vdsm attempts to save its metadata, which fails with the error above in libvirt. No idea why. I can't see any notice in libvirt documentation that writing metadata can't be performed while post-copy migration phase is still running. So it looks like a libvirt bug.

Nils, is the post-copy problem reproducible? Could you please try to reproduce it with libvirt debug log enabled? And could you please also check the state of the VM on the destination when it happens, e.g. with `sudo virsh -r list --all'?

Comment 14 Milan Zamazal 2019-03-19 12:10:47 UTC

(In reply to Milan Zamazal from comment #13)

> Nils, is the post-copy problem reproducible? Could you please try to
> reproduce it with libvirt debug log enabled?

I mean with libvirt debug log enabled on the destination. And please enable also Vdsm debug log on both the ends.

Comment 15 Milan Zamazal 2019-03-19 12:28:29 UTC

Another problem is with handling the VM after the failure in post-copy. The error is caught by a generic exception handler in Vm._startUnderlyingVm, which sets the VM to DOWN status. I can't see what's the actual state of the VM, but since it remains paused without any progress on the source, I assume it's no longer running on the destination (it's either paused or down). Blindly assuming that a VM is in DOWN status after it failed to start may not be a good idea, the VM may be already running.

Destroy attempts follow, but libvirt apparently fails to destroy the VM with VIR_ERR_SYSTEM_ERROR. That means the VM remains, from RHV point of view, in DOWN status on the destination and in PAUSED status on the source forever. I don't think we can do much about that, since libvirt doesn't cooperate.

There are also reports in vdsm.log about stuck executor workers, which additionally indicates unhealthy libvirt state.

Comment 16 Milan Zamazal 2019-03-19 12:59:41 UTC

Regarding the problem that Engine permits migration to the original host while the VM remnant is still there: Normally (at least in 4.3), Engine calls Destroy on the migration source immediately after the migration is completed. I can see that in the reported case there is no Destroy call, despite VM_MIGRATION_DONE event has been received. I don't have any idea why.

Comment 17 Nils Koenig 2019-03-20 11:14:16 UTC

Created attachment 1546015 [details]
lu0540 log files with debug logging on during Post-copy migration failure

Comment 18 Nils Koenig 2019-03-20 11:14:55 UTC

Created attachment 1546018 [details]
lu0541 Log files with debug logging on during Post-copy migration failure

Comment 19 Nils Koenig 2019-03-20 11:15:43 UTC

Created attachment 1546019 [details]
rhvm engine.log during Post-copy migration failure

Comment 20 Nils Koenig 2019-03-20 11:19:32 UTC

Please find new log files with debug logging turned on attached.
The first 4 post-copy migrations were successful, then it failes with the error:

Mar 20, 2019, 11:49:09 AM VM lu0569 is down with error. Exit message: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params).

Comment 21 Milan Zamazal 2019-03-20 17:10:30 UTC

It seems the problem with post-copy is as follows: When a migration switches to post-copy and the VM gets UP on the destination, the libvirt migration job is still running. While the migration job is running, other common libvirt jobs, such as writing metadata or some VM info retrievals, are blocked. libvirt doesn't reject those jobs immediately but waits for the running job to finish, with a 30 seconds timeout. If post-copy phase takes more than 30 seconds then the jobs time out and the erroneous situation as described in Comment 13 and Comment 15 happens. The problem has been apparently hidden so far since there were no migrations with a long post-copy phase.

I'm going to check the hypothesis on my setup. If I can confirm it then it should be fixed in Vdsm. The solution would be to wait with metadata writing and with starting periodic operations on the VM until the migration completely finishes.

Comment 22 Nils Koenig 2019-03-21 15:31:50 UTC

Created attachment 1546590 [details]
Debug log files during migrating with "Suspend Workload" policy

Mar 21, 2019, 3:28:19 PM Migration failed  (VM: lu0569, Source: lu0540, Destination: lu0541).

Comment 23 Nils Koenig 2019-03-21 15:38:14 UTC

Thank you for investigation the issue, Milan.

As stated in the beginning, there is also a bug when migrating with the "Suspend Workload" policy. Please find the logs attached. Note that there are quite a lot of successful migrations until the migration failes:

Mar 21, 2019, 3:28:19 PM Migration failed  (VM: lu0569, Source: lu0540, Destination: lu0541).

The VM is under heavy CPU and memory load during migration.

Comment 24 Michal Skrivanek 2019-03-21 18:00:11 UTC

Nils, that Suspend case, was the migration back to the original host triggered manually? Was it intentional or what was the reason?

Comment 25 Milan Zamazal 2019-03-22 12:32:34 UTC

After playing with the post-copy scenario on my setup, I've found several problems:

- I reproduced the problem from Comment 21 with libvirt unable to perform metadata sync etc. on the destination when migration is still in post-copy. It can be addressed in Vdsm, a patch is in preparation.

- If Vdsm is restarted on the destination while post-copy migration is running there, it doesn't detect that the VM is migrating and in post-copy. It should be fixable easily, I'll make a patch.

- libvirt doesn't send VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event on the source when switching to post-copy. That means Vdsm doesn't detect that it's in a post-copy phase and doesn't announce the post-copy phase to Engine. I think it's a regression in libvirt, I'm going to send a bug report on libvirt.

- When a post-copy migration fails due to the problem from Comment 21 and the VM gets killed on the destination, the VM gets resumed (no, not restarted) on the source host. (This doesn't happen in Nils's case, since the VM can't be destroyed there due to some busy device.) That looks wrong and as a libvirt/QEMU regression. I'll ask libvirt guys about it.

- In case the VM gets resumed on the source after attempting to switch to post-copy, it runs happily there and can be migrated again. However Vdsm hasn't cleared its post-copy flag, so it thinks the VM is in post-copy since the beginning of the migration and doesn't apply the selected migration policy. This is trivial to fix, a patch is in preparation.

- Memory iteration detection in Vdsm may not be 100% reliable. Since we don't have to support old libvirt versions anymore, we can get info about memory iterations from libvirt and abandon the guesswork. This is not a regression but as we are already on it, it should be improved; a patch is in preparation.

(Nils, I'll look at the "Suspend Workload" problems once I handle the post-copy problems, they are too many at the moment.)

Comment 26 Milan Zamazal 2019-03-25 15:29:26 UTC

(In reply to Milan Zamazal from comment #25)

> - libvirt doesn't send VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event on the
> source when switching to post-copy. That means Vdsm doesn't detect that it's
> in a post-copy phase and doesn't announce the post-copy phase to Engine. I
> think it's a regression in libvirt, I'm going to send a bug report on
> libvirt.
> 
> - When a post-copy migration fails due to the problem from Comment 21 and
> the VM gets killed on the destination, the VM gets resumed (no, not
> restarted) on the source host. (This doesn't happen in Nils's case, since
> the VM can't be destroyed there due to some busy device.) That looks wrong
> and as a libvirt/QEMU regression. I'll ask libvirt guys about it.

These libvirt problems have already been fixed in one of the recent libvirt downstream updates: https://bugzilla.redhat.com/1647365

So only Vdsm post-copy fixes remain, patches for all of them are posted.

Comment 27 Nils Koenig 2019-03-25 17:02:03 UTC

(In reply to Michal Skrivanek from comment #24)
> Nils, that Suspend case, was the migration back to the original host
> triggered manually? Was it intentional or what was the reason?

The migration is triggered intentionally. For the SAP certification I am currently doing it is a requirement to migrate the VM 10+ times back and forth while running a high workload benchmark.
I use the python API to initiate the migration via calling:  

vm_to_migrate_service = vms_service.vm_service(vm_to_migrate.id)
vm_to_migrate_service.migrate(force=True, host=destination)

Comment 28 Nils Koenig 2019-03-26 14:01:08 UTC

> 
> So only Vdsm post-copy fixes remain, patches for all of them are posted.
>

Milan, does that mean all patches are already available?

Comment 29 Milan Zamazal 2019-03-26 16:48:46 UTC

All patches for the post-copy problems I discovered are posted.

I looked into the logs you posted in Comment 22. The migration failed there because it couldn't converge. The migration couldn't succeeded despite increasing migration downtimes and the convergence schedule has eventually canceled it. That can happen with busy VMs. And since the Vdsm log looks fine to me, I think there is no bug there.

Comment 30 Nils Koenig 2019-04-01 14:43:48 UTC

As far as I can see the target release for this patches is 4.3.
May I ask you to backport this to 4.2? In case of objections, please contact Arne Arnold (aarnold).

Comment 31 Nils Koenig 2019-04-02 07:55:19 UTC

(In reply to Milan Zamazal from comment #29)
> All patches for the post-copy problems I discovered are posted.
> 
> I looked into the logs you posted in Comment 22. The migration failed there
> because it couldn't converge. The migration couldn't succeeded despite
> increasing migration downtimes and the convergence schedule has eventually
> canceled it. That can happen with busy VMs. And since the Vdsm log looks
> fine to me, I think there is no bug there.

Is there a way to increase the timeout for canceling the migration?

Would this also happen when migrating a busy VM with post-copy,
that a timeout will cancel the migration?

Please see also my Comment #30 regarding a backport.

Comment 32 Milan Zamazal 2019-04-02 09:58:40 UTC

(In reply to Nils Koenig from comment #31)

> Is there a way to increase the timeout for canceling the migration?

It's possible to tweak the migration schedule in the database, but I don't know whether it is supported.

To be precise, the migration wasn't canceled due to a timeout. Several iterations had been attempted, with increasing downtime, and since the last attempt with 5 seconds downtime had also failed, the migration was aborted.

Maybe modifying "Suspended workload" schedule, such as permitting one or two more attempts with 5 seconds downtime to work around temporary spikes, would help. I'm not sure whether we would like that or whether only post-copy is recommended in such a case. It may be also worth to consider whether network capacity is sufficient if it can't transfer enough data even with the quite long downtime.

> Would this also happen when migrating a busy VM with post-copy,
> that a timeout will cancel the migration?

No. Once migration enters post-copy phase, it can't be canceled and it is guaranteed to converge (with the fixes posted here).

(In theory, the 6 hour general migration timeout could be hit before entering post-copy, but it shouldn't happen with any reasonable hardware setup.)

> Please see also my Comment #30 regarding a backport.

I'm aware of it, but I'm not someone who can decide it. Ryan, Michal, Martin?

Comment 33 Polina 2019-05-07 13:21:51 UTC

The problem was reproduced on vdsm-4.30.12-2.git3ab3415.el7.x86_64 and the fix is verified on vdsm-4.40.0-208.git002018c.el7.x86_64.
Steps to reproduce.:
1. On two hosts allocate about 50-60% of memory and consume about 50% cpu.
2. Run VM configured with post-copy migration policy (I also configured the cluster with this policy).
3. Run VM . 
4. slow down the network on one of the hosts with tc command:
tc qdisc add dev ovirtmgmt root netem delay 300ms
   To remove the rule: 
tc qdisc del dev ovirtmgmt root

5. On the VM allocate the memory with Milan's script :

import sys

mb = int(sys.argv[1])
size = mb * 1024 * 128
l = [0] * size

i = 0
while True:
    l[i] = (l[i] + 1) % 1024
    i = (i + 4095) % size

6. Migrate the VM

Result: migration is very slow and enters post-copy mode:
Seen in vdsm.log - 
2019-05-07 13:49:07,466+0300 INFO  (libvirt/events) [virt.vm] (vmId='1eec678f-5d79-4bf9-b563-71ad57fad61c') Migration switched to post-copy mode (vm:5559)
No failure . 

Before the fix we faced "VM golden_env_mixed_virtio_0 is down with error. Exit message: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)."

Comment 34 RHV bug bot 2019-05-16 15:29:33 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 37 Robert McSwain 2019-08-26 13:21:11 UTC

Out of curiosity for a customer who's experiencing this issue, is there any current workaround while we await a final release of this patch? Thanks!

Comment 38 Polina 2019-08-26 18:14:22 UTC

The issue was tested and verified on engine-4.3.4 (https://bugzilla.redhat.com/show_bug.cgi?id=1711792#c35) and also on upstream 4.4.0 (the upstream_verified flag is set).
Ryen, could please answer about the final release? Is there something that must be completed from QE side?

Comment 39 Ryan Barry 2019-08-26 18:16:20 UTC

No, nothing. Robert, see the linked bug. This shipped in 4.3.4

Comment 40 Daniel Gur 2019-08-28 13:13:48 UTC

sync2jira

Comment 41 Daniel Gur 2019-08-28 13:18:01 UTC

sync2jira

Comment 42 RHV bug bot 2019-10-22 17:25:39 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 43 RHV bug bot 2019-10-22 17:39:24 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 44 RHV bug bot 2019-10-22 17:46:37 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 45 RHV bug bot 2019-10-22 18:02:26 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 46 RHV bug bot 2019-11-19 11:52:46 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 47 RHV bug bot 2019-11-19 12:02:49 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 50 RedHat Israel QE 2019-12-13 11:31:51 UTC

QE verification bot: the bug was verified upstream

Comment 51 RHV bug bot 2019-12-13 13:17:12 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 52 RHV bug bot 2019-12-20 17:46:26 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 53 RHV bug bot 2020-01-08 14:49:57 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 54 RHV bug bot 2020-01-08 15:19:38 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 55 RHV bug bot 2020-01-24 19:51:42 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 60 errata-xmlrpc 2020-08-04 13:26:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246

Note You need to log in before you can comment on or make changes to this bug.