Bug 1711792 - [downstream clone - 4.3.4] Migration aborted, probably due to stalling
Summary: [downstream clone - 4.3.4] Migration aborted, probably due to stalling
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.2.7
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.3.4
: 4.3.1
Assignee: Milan Zamazal
QA Contact: Polina
URL:
Whiteboard:
Depends On: 1688159
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-20 06:45 UTC by RHV bug bot
Modified: 2020-08-03 15:32 UTC (History)
16 users (show)

Fixed In Version: vdsm-4.30.16
Doc Type: Bug Fix
Doc Text:
When a virtual machine migration entered post-copy mode and remained in that mode for a long time, the migration sometimes failed and the migrated virtual machine was powered off. In this release, post-copy migrations are maintained to completion.
Clone Of: 1688159
Environment:
Last Closed: 2019-06-20 14:48:41 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4028881 0 Troubleshoot None Why VM gets crashed in the post-copy phase during live migration 2019-05-20 06:46:50 UTC
Red Hat Product Errata RHBA-2019:1567 0 None None None 2019-06-20 14:48:49 UTC
oVirt gerrit 98792 0 master MERGED virt: Use libvirt memory iteration info in migrations 2020-05-29 20:27:44 UTC
oVirt gerrit 98793 0 master MERGED virt: Reset post-copy flag at the beginning of a migration 2020-05-29 20:27:44 UTC
oVirt gerrit 98794 0 master MERGED virt: Don't finish incoming migration when switching to post-copy 2020-05-29 20:27:44 UTC
oVirt gerrit 98795 0 master MERGED virt: Detect incoming post-copy migration in recovery 2020-05-29 20:27:43 UTC
oVirt gerrit 99098 0 ovirt-4.3 MERGED virt: Use libvirt memory iteration info in migrations 2020-05-29 20:27:43 UTC
oVirt gerrit 99099 0 ovirt-4.3 MERGED virt: Reset post-copy flag at the beginning of a migration 2020-05-29 20:27:43 UTC
oVirt gerrit 99100 0 ovirt-4.3 MERGED virt: Don't finish incoming migration when switching to post-copy 2020-05-29 20:27:43 UTC
oVirt gerrit 99101 0 ovirt-4.3 MERGED virt: Detect incoming post-copy migration in recovery 2020-05-29 20:27:43 UTC

Description RHV bug bot 2019-05-20 06:45:38 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1688159 +++
======================================================================

Created attachment 1543575 [details]
Log, VM config and screenshots

Description of problem:
Migration a VM under load failed a few times.

How reproducible:
Test scenario is migrating a VM running a SAP HANA PBO workload (memory and CPU intensive load). Out of twelve migrations two failed and worked after retrying. 
Test scenario explicitly requires migration VM under load.

Log shows: 

2019-03-12 18:42:32,448+0100 WARN  (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Migration stalling: remaining (32143MiB) > lowmark (0MiB). (migration:837)
2019-03-12 18:42:32,448+0100 WARN  (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Aborting migration (migration:902)

Does the abort happen due to too high change rate?
Migration network is 10G. 

Could adjusting migration_max_time_per_gib_mem or migration_progress_timeout help? Please advise.

(Originally by Nils Koenig)

Comment 1 RHV bug bot 2019-05-20 06:45:40 UTC
Apparently the way to handle this would be migration profiles (https://ovirt.org/develop/release-management/features/virt/migration-enhancements.html). Is there a way to disable or modify the stall timeout?

(Originally by Nils Koenig)

Comment 2 RHV bug bot 2019-05-20 06:45:42 UTC
It's very likely to be a failure to converge due to change rate. Have you tried postcopy migrations?

(Originally by Ryan Barry)

Comment 3 RHV bug bot 2019-05-20 06:45:44 UTC
@Ryan Barry

Haven't been able to successfully change the migration policy.
The documentation I find does not tell how to do so: https://access.redhat.com/solutions/3143541

Here is what I tried using the ovirt python API:

# change policy
suspend_workload_policy=types.MigrationPolicy()
suspend_workload_policy.id="80554327-0569-496b-bdeb-fcbbf52b827c"
vm_to_migrate.migration.policy=suspend_workload_policy

But it seems to have no effect, from vdsm.log:

2019-03-13 20:08:17,658+0100 INFO  (jsonrpc/2) [api.virt] START migrate(params={u'incomingLimit': 2, u'src': u'lu0540.wdf.sap.corp', u'dstqemu': u'10.76.34.34', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'lu0541.wdf.sap.corp:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 5000, u'method': u'online'}) from=::ffff:10.20.90.165,59716, flow_id=3c9b1a2c-9dc7-41e8-b8b9-862b55d4b71a, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:46)

Where can I find guidance on how to change the migration policy?

(Originally by Nils Koenig)

Comment 4 RHV bug bot 2019-05-20 06:45:45 UTC
Hi Nils -

If you change:

stalling:
  - action
    - setDowntime...

To 'postcopy', you can give it a try. It's been a tech preview since 4.1, but is stable. There is a risk if the network goes down during the migration, which is why it's still tech preview, but it will almost always converge.

(Originally by Ryan Barry)

Comment 5 RHV bug bot 2019-05-20 06:45:47 UTC
Created attachment 1545183 [details]
Screenshot of Error Messages

Error messages shown in VM event log for migration policys: "Suspend workload if needed" and "Post-copy".

(Originally by Nils Koenig)

Comment 6 RHV bug bot 2019-05-20 06:45:49 UTC
Created attachment 1545184 [details]
vdsm.log during migration

(Originally by Nils Koenig)

Comment 7 RHV bug bot 2019-05-20 06:45:52 UTC
I've tried with "Suspend workload if needed" and "Post-copy migration" and both fail.

With "Suspend workload if needed" the VM has been migrated a couple of times back and forth but it fails with the following error:

2019-03-14 18:29:14,814+0100 INFO  (jsonrpc/1) [api.virt] START migrate(params={u'incomingLimit': 1, u'src': u'lu0541.wdf.sap.corp', u'dstqemu': u'10.76.34.32', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'lu0540.wdf.sap.corp:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7', u'abortOnError': u'true', u'outgoingLimit': 1, u'compressed': u'true', u'maxBandwidth': 5000, u'method': u'online'}) from=::ffff:10.20.90.165,52794, flow_id=9b2b8655-3ccd-4d7a-994b-7019e6bf0f33, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:46)
2019-03-14 18:29:14,815+0100 INFO  (jsonrpc/1) [api.virt] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::ffff:10.20.90.165,52794, flow_id=9b2b8655-3ccd-4d7a-994b-7019e6bf0f33, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:52)
2019-03-14 18:29:14,815+0100 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.migrate succeeded in 0.00 seconds (__init__:573)
2019-03-14 18:29:14,892+0100 INFO  (migsrc/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Creation of destination VM took: 0 seconds (migration:473)

2019-03-14 18:29:14,893+0100 ERROR (migsrc/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') migration destination error: Virtual machine already exists (migration:290)


Using "Post-copy migration":
Migration seems to fail according to VM event log (see screenshot) at 19:05:03.
In vdsm.log (attached) the migration seems to continue running e.g.

2019-03-14 20:00:51,761+0100 INFO  (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Migration Progress: 4412 seconds elapsed, 69% of data processed, total data: 524360MB, processed data: 392508MB, remaining data: 164858MB, transfer speed 799MBps, zero pages: 81266950MB, compressed: 0MB, dirty rate: 791924, memory iteration: 5 (migration:867)

but it does not finish or advance.

(Originally by Nils Koenig)

Comment 8 RHV bug bot 2019-05-20 06:45:54 UTC
Milan, any ideas here?

(Originally by Ryan Barry)

Comment 9 RHV bug bot 2019-05-20 06:45:56 UTC
We may be dealing with multiple issues here.

Regarding "Suspend workload if needed" from Comment 7, the migration is attempted several seconds after the VM has been migrated in the opposite direction. The message "Virtual machine already exists" most likely means that VM cleanup hasn't been finished on the original host yet. I think Engine shouldn't initiate migration in such a case, but it's a non-fatal problem, it should start working after a short while.

Regarding "Post-copy migration", the stalling migration switches to post-copy, the VM gets suspended as expected and remaining data is being copied for a while. But then the progress gets stuck and Engine reports the given error at the same time. Apparently something happened on the destination or in Engine.

Nils, could you please provide the corresponding Vdsm log from the other host + the corresponding Engine log so that we can look further into what happened?

(Originally by Milan Zamazal)

Comment 10 RHV bug bot 2019-05-20 06:45:58 UTC
I agree, those are two issues.

The quemu process is indeed still running:

[root@lu0541 ~]# ps aux | grep lu0569
qemu     184271 54.7  0.0 539747592 248600 ?    Sl   Mar14 3119:14 /usr/libexec/qemu-kvm -name guest=lu0569,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-8-lu0569/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu host,rdtscp=on,invtsc=on,tsc-frequency=2494140000,host-cache-info=on,l3-cache=off -m 524288 -realtime mlock=off -smp 216,sockets=4,cores=27,threads=2 -object iothread,id=iothread1 -mem-prealloc -mem-path /dev/hugepages/libvirt/qemu/8-lu0569 -numa node,nodeid=0,cpus=0-215,mem=524288 -uuid 8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.6-4.el7,serial=4c4c4544-0033-3510-8050-c7c04f595232,uuid=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=38,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device virtio-scsi-pci,iothread=iothread1,id=ua-ee65335b-0773-4900-8277-e4f4a551c13f,bus=pci.0,addr=0x3 -device virtio-serial-pci,id=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ua-bf8317f4-db0f-4f75-909f-ae288ec106ba,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-bf8317f4-db0f-4f75-909f-ae288ec106ba,id=ua-bf8317f4-db0f-4f75-909f-ae288ec106ba -drive file=/rhev/data-center/mnt/blockSD/944da68a-ae00-4e82-9273-84456e3fd17c/images/c3931013-1c03-4dd1-981e-99032844b934/4b9fae5b-127e-4438-9f29-13bf52f7d9b7,format=raw,if=none,id=drive-ua-c3931013-1c03-4dd1-981e-99032844b934,serial=c3931013-1c03-4dd1-981e-99032844b934,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x6,drive=drive-ua-c3931013-1c03-4dd1-981e-99032844b934,id=ua-c3931013-1c03-4dd1-981e-99032844b934,bootindex=1,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000007,format=raw,if=none,id=drive-ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,id=ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000006,format=raw,if=none,id=drive-ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x8,drive=drive-ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,id=ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000017,format=raw,if=none,id=drive-ua-f5d29ab8-7816-4223-89db-0bf831a4d032,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x9,drive=drive-ua-f5d29ab8-7816-4223-89db-0bf831a4d032,id=ua-f5d29ab8-7816-4223-89db-0bf831a4d032,write-cache=on -drive file=/dev/shm/vhostmd0,format=raw,if=none,id=drive-virtio-disk701,readonly=on -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk701,id=virtio-disk701 -netdev tap,fds=39:40:41:36,id=hostua-75c0c780-2642-47ff-b70f-439315b1e493,vhost=on,vhostfds=42:43:44:45 -device virtio-net-pci,mq=on,vectors=10,host_mtu=1500,netdev=hostua-75c0c780-2642-47ff-b70f-439315b1e493,id=ua-75c0c780-2642-47ff-b70f-439315b1e493,mac=00:1a:4a:16:01:01,bus=pci.0,addr=0xc -chardev socket,id=charchannel0,fd=46,server,nowait -device virtserialport,bus=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=47,server,nowait -device virtserialport,bus=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -vnc 10.76.34.34:0,password -k de -device qxl-vga,id=ua-a1893d17-db06-4250-b46e-b5ab0e0f91a8,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -incoming defer -object rng-random,id=objua-9adbf954-a5b0-4533-9458-8edec55820a4,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9adbf954-a5b0-4533-9458-8edec55820a4,id=ua-9adbf954-a5b0-4533-9458-8edec55820a4,bus=pci.0,addr=0xa -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on


But it can't reached via network nor console - it appears offline in RHVM.

Please find the requested logs attached.

(Originally by Nils Koenig)

Comment 11 RHV bug bot 2019-05-20 06:46:00 UTC
Created attachment 1545379 [details]
lu0540 vdsm.log

(Originally by Nils Koenig)

Comment 12 RHV bug bot 2019-05-20 06:46:02 UTC
Created attachment 1545380 [details]
ovirt engine log

(Originally by Nils Koenig)

Comment 13 RHV bug bot 2019-05-20 06:46:04 UTC
Thank you Nils for the logs. The problem with post-copy is this:

2019-03-14 19:04:19,249+0100 ERROR (vm/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') The vm start process failed (vm:949)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 893, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4162, in _completeIncomingMigration
    self._attachLibvirtDomainAfterMigration(finished, timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4227, in _attachLibvirtDomainAfterMigration
    self._sync_metadata()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5253, in _sync_metadata
    self._md_desc.dump(self._dom)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/metadata.py", line 514, in dump
    0)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 106, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
2019-03-14 19:04:19,249+0100 INFO  (vm/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Changed state to Down: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) (code=1) (vm:1689)

When the VM is started on the destination after switching to post-copy and while still migrating, Vdsm attempts to save its metadata, which fails with the error above in libvirt. No idea why. I can't see any notice in libvirt documentation that writing metadata can't be performed while post-copy migration phase is still running. So it looks like a libvirt bug.

Nils, is the post-copy problem reproducible? Could you please try to reproduce it with libvirt debug log enabled? And could you please also check the state of the VM on the destination when it happens, e.g. with `sudo virsh -r list --all'?

(Originally by Milan Zamazal)

Comment 14 RHV bug bot 2019-05-20 06:46:06 UTC
(In reply to Milan Zamazal from comment #13)

> Nils, is the post-copy problem reproducible? Could you please try to
> reproduce it with libvirt debug log enabled?

I mean with libvirt debug log enabled on the destination. And please enable also Vdsm debug log on both the ends.

(Originally by Milan Zamazal)

Comment 15 RHV bug bot 2019-05-20 06:46:07 UTC
Another problem is with handling the VM after the failure in post-copy. The error is caught by a generic exception handler in Vm._startUnderlyingVm, which sets the VM to DOWN status. I can't see what's the actual state of the VM, but since it remains paused without any progress on the source, I assume it's no longer running on the destination (it's either paused or down). Blindly assuming that a VM is in DOWN status after it failed to start may not be a good idea, the VM may be already running.

Destroy attempts follow, but libvirt apparently fails to destroy the VM with VIR_ERR_SYSTEM_ERROR. That means the VM remains, from RHV point of view, in DOWN status on the destination and in PAUSED status on the source forever. I don't think we can do much about that, since libvirt doesn't cooperate.

There are also reports in vdsm.log about stuck executor workers, which additionally indicates unhealthy libvirt state.

(Originally by Milan Zamazal)

Comment 16 RHV bug bot 2019-05-20 06:46:09 UTC
Regarding the problem that Engine permits migration to the original host while the VM remnant is still there: Normally (at least in 4.3), Engine calls Destroy on the migration source immediately after the migration is completed. I can see that in the reported case there is no Destroy call, despite VM_MIGRATION_DONE event has been received. I don't have any idea why.

(Originally by Milan Zamazal)

Comment 17 RHV bug bot 2019-05-20 06:46:11 UTC
Created attachment 1546015 [details]
lu0540 log files with debug logging on during Post-copy migration failure

(Originally by Nils Koenig)

Comment 18 RHV bug bot 2019-05-20 06:46:13 UTC
Created attachment 1546018 [details]
lu0541 Log files with debug logging on during Post-copy migration failure

(Originally by Nils Koenig)

Comment 19 RHV bug bot 2019-05-20 06:46:15 UTC
Created attachment 1546019 [details]
rhvm engine.log during Post-copy migration failure

(Originally by Nils Koenig)

Comment 20 RHV bug bot 2019-05-20 06:46:17 UTC
Please find new log files with debug logging turned on attached.
The first 4 post-copy migrations were successful, then it failes with the error:

Mar 20, 2019, 11:49:09 AM VM lu0569 is down with error. Exit message: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params).

(Originally by Nils Koenig)

Comment 21 RHV bug bot 2019-05-20 06:46:18 UTC
It seems the problem with post-copy is as follows: When a migration switches to post-copy and the VM gets UP on the destination, the libvirt migration job is still running. While the migration job is running, other common libvirt jobs, such as writing metadata or some VM info retrievals, are blocked. libvirt doesn't reject those jobs immediately but waits for the running job to finish, with a 30 seconds timeout. If post-copy phase takes more than 30 seconds then the jobs time out and the erroneous situation as described in Comment 13 and Comment 15 happens. The problem has been apparently hidden so far since there were no migrations with a long post-copy phase.

I'm going to check the hypothesis on my setup. If I can confirm it then it should be fixed in Vdsm. The solution would be to wait with metadata writing and with starting periodic operations on the VM until the migration completely finishes.

(Originally by Milan Zamazal)

Comment 22 RHV bug bot 2019-05-20 06:46:20 UTC
Created attachment 1546590 [details]
Debug log files during migrating with "Suspend Workload" policy

Mar 21, 2019, 3:28:19 PM Migration failed  (VM: lu0569, Source: lu0540, Destination: lu0541).

(Originally by Nils Koenig)

Comment 23 RHV bug bot 2019-05-20 06:46:22 UTC
Thank you for investigation the issue, Milan.

As stated in the beginning, there is also a bug when migrating with the "Suspend Workload" policy. Please find the logs attached. Note that there are quite a lot of successful migrations until the migration failes:

Mar 21, 2019, 3:28:19 PM Migration failed  (VM: lu0569, Source: lu0540, Destination: lu0541).

The VM is under heavy CPU and memory load during migration.

(Originally by Nils Koenig)

Comment 24 RHV bug bot 2019-05-20 06:46:24 UTC
Nils, that Suspend case, was the migration back to the original host triggered manually? Was it intentional or what was the reason?

(Originally by michal.skrivanek)

Comment 25 RHV bug bot 2019-05-20 06:46:25 UTC
After playing with the post-copy scenario on my setup, I've found several problems:

- I reproduced the problem from Comment 21 with libvirt unable to perform metadata sync etc. on the destination when migration is still in post-copy. It can be addressed in Vdsm, a patch is in preparation.

- If Vdsm is restarted on the destination while post-copy migration is running there, it doesn't detect that the VM is migrating and in post-copy. It should be fixable easily, I'll make a patch.

- libvirt doesn't send VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event on the source when switching to post-copy. That means Vdsm doesn't detect that it's in a post-copy phase and doesn't announce the post-copy phase to Engine. I think it's a regression in libvirt, I'm going to send a bug report on libvirt.

- When a post-copy migration fails due to the problem from Comment 21 and the VM gets killed on the destination, the VM gets resumed (no, not restarted) on the source host. (This doesn't happen in Nils's case, since the VM can't be destroyed there due to some busy device.) That looks wrong and as a libvirt/QEMU regression. I'll ask libvirt guys about it.

- In case the VM gets resumed on the source after attempting to switch to post-copy, it runs happily there and can be migrated again. However Vdsm hasn't cleared its post-copy flag, so it thinks the VM is in post-copy since the beginning of the migration and doesn't apply the selected migration policy. This is trivial to fix, a patch is in preparation.

- Memory iteration detection in Vdsm may not be 100% reliable. Since we don't have to support old libvirt versions anymore, we can get info about memory iterations from libvirt and abandon the guesswork. This is not a regression but as we are already on it, it should be improved; a patch is in preparation.

(Nils, I'll look at the "Suspend Workload" problems once I handle the post-copy problems, they are too many at the moment.)

(Originally by Milan Zamazal)

Comment 26 RHV bug bot 2019-05-20 06:46:27 UTC
(In reply to Milan Zamazal from comment #25)

> - libvirt doesn't send VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event on the
> source when switching to post-copy. That means Vdsm doesn't detect that it's
> in a post-copy phase and doesn't announce the post-copy phase to Engine. I
> think it's a regression in libvirt, I'm going to send a bug report on
> libvirt.
> 
> - When a post-copy migration fails due to the problem from Comment 21 and
> the VM gets killed on the destination, the VM gets resumed (no, not
> restarted) on the source host. (This doesn't happen in Nils's case, since
> the VM can't be destroyed there due to some busy device.) That looks wrong
> and as a libvirt/QEMU regression. I'll ask libvirt guys about it.

These libvirt problems have already been fixed in one of the recent libvirt downstream updates: https://bugzilla.redhat.com/1647365

So only Vdsm post-copy fixes remain, patches for all of them are posted.

(Originally by Milan Zamazal)

Comment 27 RHV bug bot 2019-05-20 06:46:29 UTC
(In reply to Michal Skrivanek from comment #24)
> Nils, that Suspend case, was the migration back to the original host
> triggered manually? Was it intentional or what was the reason?

The migration is triggered intentionally. For the SAP certification I am currently doing it is a requirement to migrate the VM 10+ times back and forth while running a high workload benchmark.
I use the python API to initiate the migration via calling:  

vm_to_migrate_service = vms_service.vm_service(vm_to_migrate.id)
vm_to_migrate_service.migrate(force=True, host=destination)

(Originally by Nils Koenig)

Comment 28 RHV bug bot 2019-05-20 06:46:31 UTC
> 
> So only Vdsm post-copy fixes remain, patches for all of them are posted.
>

Milan, does that mean all patches are already available?

(Originally by Nils Koenig)

Comment 29 RHV bug bot 2019-05-20 06:46:33 UTC
All patches for the post-copy problems I discovered are posted.

I looked into the logs you posted in Comment 22. The migration failed there because it couldn't converge. The migration couldn't succeeded despite increasing migration downtimes and the convergence schedule has eventually canceled it. That can happen with busy VMs. And since the Vdsm log looks fine to me, I think there is no bug there.

(Originally by Milan Zamazal)

Comment 30 RHV bug bot 2019-05-20 06:46:35 UTC
As far as I can see the target release for this patches is 4.3.
May I ask you to backport this to 4.2? In case of objections, please contact Arne Arnold (aarnold).

(Originally by Nils Koenig)

Comment 31 RHV bug bot 2019-05-20 06:46:37 UTC
(In reply to Milan Zamazal from comment #29)
> All patches for the post-copy problems I discovered are posted.
> 
> I looked into the logs you posted in Comment 22. The migration failed there
> because it couldn't converge. The migration couldn't succeeded despite
> increasing migration downtimes and the convergence schedule has eventually
> canceled it. That can happen with busy VMs. And since the Vdsm log looks
> fine to me, I think there is no bug there.

Is there a way to increase the timeout for canceling the migration?

Would this also happen when migrating a busy VM with post-copy,
that a timeout will cancel the migration?

Please see also my Comment #30 regarding a backport.

(Originally by Nils Koenig)

Comment 32 RHV bug bot 2019-05-20 06:46:38 UTC
(In reply to Nils Koenig from comment #31)

> Is there a way to increase the timeout for canceling the migration?

It's possible to tweak the migration schedule in the database, but I don't know whether it is supported.

To be precise, the migration wasn't canceled due to a timeout. Several iterations had been attempted, with increasing downtime, and since the last attempt with 5 seconds downtime had also failed, the migration was aborted.

Maybe modifying "Suspended workload" schedule, such as permitting one or two more attempts with 5 seconds downtime to work around temporary spikes, would help. I'm not sure whether we would like that or whether only post-copy is recommended in such a case. It may be also worth to consider whether network capacity is sufficient if it can't transfer enough data even with the quite long downtime.

> Would this also happen when migrating a busy VM with post-copy,
> that a timeout will cancel the migration?

No. Once migration enters post-copy phase, it can't be canceled and it is guaranteed to converge (with the fixes posted here).

(In theory, the 6 hour general migration timeout could be hit before entering post-copy, but it shouldn't happen with any reasonable hardware setup.)

> Please see also my Comment #30 regarding a backport.

I'm aware of it, but I'm not someone who can decide it. Ryan, Michal, Martin?

(Originally by Milan Zamazal)

Comment 33 RHV bug bot 2019-05-20 06:46:40 UTC
The problem was reproduced on vdsm-4.30.12-2.git3ab3415.el7.x86_64 and the fix is verified on vdsm-4.40.0-208.git002018c.el7.x86_64.
Steps to reproduce.:
1. On two hosts allocate about 50-60% of memory and consume about 50% cpu.
2. Run VM configured with post-copy migration policy (I also configured the cluster with this policy).
3. Run VM . 
4. slow down the network on one of the hosts with tc command:
tc qdisc add dev ovirtmgmt root netem delay 300ms
   To remove the rule: 
tc qdisc del dev ovirtmgmt root

5. On the VM allocate the memory with Milan's script :

import sys

mb = int(sys.argv[1])
size = mb * 1024 * 128
l = [0] * size

i = 0
while True:
    l[i] = (l[i] + 1) % 1024
    i = (i + 4095) % size

6. Migrate the VM

Result: migration is very slow and enters post-copy mode:
Seen in vdsm.log - 
2019-05-07 13:49:07,466+0300 INFO  (libvirt/events) [virt.vm] (vmId='1eec678f-5d79-4bf9-b563-71ad57fad61c') Migration switched to post-copy mode (vm:5559)
No failure . 

Before the fix we faced "VM golden_env_mixed_virtio_0 is down with error. Exit message: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)."

(Originally by Polina Agranat)

Comment 34 RHV bug bot 2019-05-20 06:46:42 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 35 Polina 2019-05-28 10:45:54 UTC
verified on ovirt-engine-4.3.4.1-0.1.el7.noarch . The same verification description as for https://bugzilla.redhat.com/show_bug.cgi?id=1711792#c33.

Comment 39 errata-xmlrpc 2019-06-20 14:48:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1567

Comment 40 Daniel Gur 2019-08-28 13:11:23 UTC
sync2jira

Comment 41 Daniel Gur 2019-08-28 13:15:35 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.