Hide Forgot
+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1688159 +++ ====================================================================== Created attachment 1543575 [details] Log, VM config and screenshots Description of problem: Migration a VM under load failed a few times. How reproducible: Test scenario is migrating a VM running a SAP HANA PBO workload (memory and CPU intensive load). Out of twelve migrations two failed and worked after retrying. Test scenario explicitly requires migration VM under load. Log shows: 2019-03-12 18:42:32,448+0100 WARN (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Migration stalling: remaining (32143MiB) > lowmark (0MiB). (migration:837) 2019-03-12 18:42:32,448+0100 WARN (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Aborting migration (migration:902) Does the abort happen due to too high change rate? Migration network is 10G. Could adjusting migration_max_time_per_gib_mem or migration_progress_timeout help? Please advise. (Originally by Nils Koenig)
Apparently the way to handle this would be migration profiles (https://ovirt.org/develop/release-management/features/virt/migration-enhancements.html). Is there a way to disable or modify the stall timeout? (Originally by Nils Koenig)
It's very likely to be a failure to converge due to change rate. Have you tried postcopy migrations? (Originally by Ryan Barry)
@Ryan Barry Haven't been able to successfully change the migration policy. The documentation I find does not tell how to do so: https://access.redhat.com/solutions/3143541 Here is what I tried using the ovirt python API: # change policy suspend_workload_policy=types.MigrationPolicy() suspend_workload_policy.id="80554327-0569-496b-bdeb-fcbbf52b827c" vm_to_migrate.migration.policy=suspend_workload_policy But it seems to have no effect, from vdsm.log: 2019-03-13 20:08:17,658+0100 INFO (jsonrpc/2) [api.virt] START migrate(params={u'incomingLimit': 2, u'src': u'lu0540.wdf.sap.corp', u'dstqemu': u'10.76.34.34', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'lu0541.wdf.sap.corp:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 5000, u'method': u'online'}) from=::ffff:10.20.90.165,59716, flow_id=3c9b1a2c-9dc7-41e8-b8b9-862b55d4b71a, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:46) Where can I find guidance on how to change the migration policy? (Originally by Nils Koenig)
Hi Nils - If you change: stalling: - action - setDowntime... To 'postcopy', you can give it a try. It's been a tech preview since 4.1, but is stable. There is a risk if the network goes down during the migration, which is why it's still tech preview, but it will almost always converge. (Originally by Ryan Barry)
Created attachment 1545183 [details] Screenshot of Error Messages Error messages shown in VM event log for migration policys: "Suspend workload if needed" and "Post-copy". (Originally by Nils Koenig)
Created attachment 1545184 [details] vdsm.log during migration (Originally by Nils Koenig)
I've tried with "Suspend workload if needed" and "Post-copy migration" and both fail. With "Suspend workload if needed" the VM has been migrated a couple of times back and forth but it fails with the following error: 2019-03-14 18:29:14,814+0100 INFO (jsonrpc/1) [api.virt] START migrate(params={u'incomingLimit': 1, u'src': u'lu0541.wdf.sap.corp', u'dstqemu': u'10.76.34.32', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'lu0540.wdf.sap.corp:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7', u'abortOnError': u'true', u'outgoingLimit': 1, u'compressed': u'true', u'maxBandwidth': 5000, u'method': u'online'}) from=::ffff:10.20.90.165,52794, flow_id=9b2b8655-3ccd-4d7a-994b-7019e6bf0f33, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:46) 2019-03-14 18:29:14,815+0100 INFO (jsonrpc/1) [api.virt] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::ffff:10.20.90.165,52794, flow_id=9b2b8655-3ccd-4d7a-994b-7019e6bf0f33, vmId=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 (api:52) 2019-03-14 18:29:14,815+0100 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.migrate succeeded in 0.00 seconds (__init__:573) 2019-03-14 18:29:14,892+0100 INFO (migsrc/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Creation of destination VM took: 0 seconds (migration:473) 2019-03-14 18:29:14,893+0100 ERROR (migsrc/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') migration destination error: Virtual machine already exists (migration:290) Using "Post-copy migration": Migration seems to fail according to VM event log (see screenshot) at 19:05:03. In vdsm.log (attached) the migration seems to continue running e.g. 2019-03-14 20:00:51,761+0100 INFO (migmon/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Migration Progress: 4412 seconds elapsed, 69% of data processed, total data: 524360MB, processed data: 392508MB, remaining data: 164858MB, transfer speed 799MBps, zero pages: 81266950MB, compressed: 0MB, dirty rate: 791924, memory iteration: 5 (migration:867) but it does not finish or advance. (Originally by Nils Koenig)
Milan, any ideas here? (Originally by Ryan Barry)
We may be dealing with multiple issues here. Regarding "Suspend workload if needed" from Comment 7, the migration is attempted several seconds after the VM has been migrated in the opposite direction. The message "Virtual machine already exists" most likely means that VM cleanup hasn't been finished on the original host yet. I think Engine shouldn't initiate migration in such a case, but it's a non-fatal problem, it should start working after a short while. Regarding "Post-copy migration", the stalling migration switches to post-copy, the VM gets suspended as expected and remaining data is being copied for a while. But then the progress gets stuck and Engine reports the given error at the same time. Apparently something happened on the destination or in Engine. Nils, could you please provide the corresponding Vdsm log from the other host + the corresponding Engine log so that we can look further into what happened? (Originally by Milan Zamazal)
I agree, those are two issues. The quemu process is indeed still running: [root@lu0541 ~]# ps aux | grep lu0569 qemu 184271 54.7 0.0 539747592 248600 ? Sl Mar14 3119:14 /usr/libexec/qemu-kvm -name guest=lu0569,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-8-lu0569/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu host,rdtscp=on,invtsc=on,tsc-frequency=2494140000,host-cache-info=on,l3-cache=off -m 524288 -realtime mlock=off -smp 216,sockets=4,cores=27,threads=2 -object iothread,id=iothread1 -mem-prealloc -mem-path /dev/hugepages/libvirt/qemu/8-lu0569 -numa node,nodeid=0,cpus=0-215,mem=524288 -uuid 8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.6-4.el7,serial=4c4c4544-0033-3510-8050-c7c04f595232,uuid=8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=38,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device virtio-scsi-pci,iothread=iothread1,id=ua-ee65335b-0773-4900-8277-e4f4a551c13f,bus=pci.0,addr=0x3 -device virtio-serial-pci,id=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ua-bf8317f4-db0f-4f75-909f-ae288ec106ba,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-bf8317f4-db0f-4f75-909f-ae288ec106ba,id=ua-bf8317f4-db0f-4f75-909f-ae288ec106ba -drive file=/rhev/data-center/mnt/blockSD/944da68a-ae00-4e82-9273-84456e3fd17c/images/c3931013-1c03-4dd1-981e-99032844b934/4b9fae5b-127e-4438-9f29-13bf52f7d9b7,format=raw,if=none,id=drive-ua-c3931013-1c03-4dd1-981e-99032844b934,serial=c3931013-1c03-4dd1-981e-99032844b934,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x6,drive=drive-ua-c3931013-1c03-4dd1-981e-99032844b934,id=ua-c3931013-1c03-4dd1-981e-99032844b934,bootindex=1,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000007,format=raw,if=none,id=drive-ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,id=ua-4a475e31-b19d-4d18-a65a-4f8e306c06d6,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000006,format=raw,if=none,id=drive-ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x8,drive=drive-ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,id=ua-49029933-f73c-41ea-b1d6-de4c09dedc7a,write-cache=on -drive file=/dev/mapper/36000d31005771c000000000000000017,format=raw,if=none,id=drive-ua-f5d29ab8-7816-4223-89db-0bf831a4d032,werror=stop,rerror=stop,cache=none,aio=native -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x9,drive=drive-ua-f5d29ab8-7816-4223-89db-0bf831a4d032,id=ua-f5d29ab8-7816-4223-89db-0bf831a4d032,write-cache=on -drive file=/dev/shm/vhostmd0,format=raw,if=none,id=drive-virtio-disk701,readonly=on -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk701,id=virtio-disk701 -netdev tap,fds=39:40:41:36,id=hostua-75c0c780-2642-47ff-b70f-439315b1e493,vhost=on,vhostfds=42:43:44:45 -device virtio-net-pci,mq=on,vectors=10,host_mtu=1500,netdev=hostua-75c0c780-2642-47ff-b70f-439315b1e493,id=ua-75c0c780-2642-47ff-b70f-439315b1e493,mac=00:1a:4a:16:01:01,bus=pci.0,addr=0xc -chardev socket,id=charchannel0,fd=46,server,nowait -device virtserialport,bus=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=47,server,nowait -device virtserialport,bus=ua-3000a4a7-c10c-4e95-90a5-adc3e597e543.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -vnc 10.76.34.34:0,password -k de -device qxl-vga,id=ua-a1893d17-db06-4250-b46e-b5ab0e0f91a8,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -incoming defer -object rng-random,id=objua-9adbf954-a5b0-4533-9458-8edec55820a4,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9adbf954-a5b0-4533-9458-8edec55820a4,id=ua-9adbf954-a5b0-4533-9458-8edec55820a4,bus=pci.0,addr=0xa -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on But it can't reached via network nor console - it appears offline in RHVM. Please find the requested logs attached. (Originally by Nils Koenig)
Created attachment 1545379 [details] lu0540 vdsm.log (Originally by Nils Koenig)
Created attachment 1545380 [details] ovirt engine log (Originally by Nils Koenig)
Thank you Nils for the logs. The problem with post-copy is this: 2019-03-14 19:04:19,249+0100 ERROR (vm/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') The vm start process failed (vm:949) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 893, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4162, in _completeIncomingMigration self._attachLibvirtDomainAfterMigration(finished, timeout) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4227, in _attachLibvirtDomainAfterMigration self._sync_metadata() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5253, in _sync_metadata self._md_desc.dump(self._dom) File "/usr/lib/python2.7/site-packages/vdsm/virt/metadata.py", line 514, in dump 0) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 106, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) 2019-03-14 19:04:19,249+0100 INFO (vm/8a6b627f) [virt.vm] (vmId='8a6b627f-bcf5-4ba2-b4f5-01d1bd306eb7') Changed state to Down: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) (code=1) (vm:1689) When the VM is started on the destination after switching to post-copy and while still migrating, Vdsm attempts to save its metadata, which fails with the error above in libvirt. No idea why. I can't see any notice in libvirt documentation that writing metadata can't be performed while post-copy migration phase is still running. So it looks like a libvirt bug. Nils, is the post-copy problem reproducible? Could you please try to reproduce it with libvirt debug log enabled? And could you please also check the state of the VM on the destination when it happens, e.g. with `sudo virsh -r list --all'? (Originally by Milan Zamazal)
(In reply to Milan Zamazal from comment #13) > Nils, is the post-copy problem reproducible? Could you please try to > reproduce it with libvirt debug log enabled? I mean with libvirt debug log enabled on the destination. And please enable also Vdsm debug log on both the ends. (Originally by Milan Zamazal)
Another problem is with handling the VM after the failure in post-copy. The error is caught by a generic exception handler in Vm._startUnderlyingVm, which sets the VM to DOWN status. I can't see what's the actual state of the VM, but since it remains paused without any progress on the source, I assume it's no longer running on the destination (it's either paused or down). Blindly assuming that a VM is in DOWN status after it failed to start may not be a good idea, the VM may be already running. Destroy attempts follow, but libvirt apparently fails to destroy the VM with VIR_ERR_SYSTEM_ERROR. That means the VM remains, from RHV point of view, in DOWN status on the destination and in PAUSED status on the source forever. I don't think we can do much about that, since libvirt doesn't cooperate. There are also reports in vdsm.log about stuck executor workers, which additionally indicates unhealthy libvirt state. (Originally by Milan Zamazal)
Regarding the problem that Engine permits migration to the original host while the VM remnant is still there: Normally (at least in 4.3), Engine calls Destroy on the migration source immediately after the migration is completed. I can see that in the reported case there is no Destroy call, despite VM_MIGRATION_DONE event has been received. I don't have any idea why. (Originally by Milan Zamazal)
Created attachment 1546015 [details] lu0540 log files with debug logging on during Post-copy migration failure (Originally by Nils Koenig)
Created attachment 1546018 [details] lu0541 Log files with debug logging on during Post-copy migration failure (Originally by Nils Koenig)
Created attachment 1546019 [details] rhvm engine.log during Post-copy migration failure (Originally by Nils Koenig)
Please find new log files with debug logging turned on attached. The first 4 post-copy migrations were successful, then it failes with the error: Mar 20, 2019, 11:49:09 AM VM lu0569 is down with error. Exit message: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params). (Originally by Nils Koenig)
It seems the problem with post-copy is as follows: When a migration switches to post-copy and the VM gets UP on the destination, the libvirt migration job is still running. While the migration job is running, other common libvirt jobs, such as writing metadata or some VM info retrievals, are blocked. libvirt doesn't reject those jobs immediately but waits for the running job to finish, with a 30 seconds timeout. If post-copy phase takes more than 30 seconds then the jobs time out and the erroneous situation as described in Comment 13 and Comment 15 happens. The problem has been apparently hidden so far since there were no migrations with a long post-copy phase. I'm going to check the hypothesis on my setup. If I can confirm it then it should be fixed in Vdsm. The solution would be to wait with metadata writing and with starting periodic operations on the VM until the migration completely finishes. (Originally by Milan Zamazal)
Created attachment 1546590 [details] Debug log files during migrating with "Suspend Workload" policy Mar 21, 2019, 3:28:19 PM Migration failed (VM: lu0569, Source: lu0540, Destination: lu0541). (Originally by Nils Koenig)
Thank you for investigation the issue, Milan. As stated in the beginning, there is also a bug when migrating with the "Suspend Workload" policy. Please find the logs attached. Note that there are quite a lot of successful migrations until the migration failes: Mar 21, 2019, 3:28:19 PM Migration failed (VM: lu0569, Source: lu0540, Destination: lu0541). The VM is under heavy CPU and memory load during migration. (Originally by Nils Koenig)
Nils, that Suspend case, was the migration back to the original host triggered manually? Was it intentional or what was the reason? (Originally by michal.skrivanek)
After playing with the post-copy scenario on my setup, I've found several problems: - I reproduced the problem from Comment 21 with libvirt unable to perform metadata sync etc. on the destination when migration is still in post-copy. It can be addressed in Vdsm, a patch is in preparation. - If Vdsm is restarted on the destination while post-copy migration is running there, it doesn't detect that the VM is migrating and in post-copy. It should be fixable easily, I'll make a patch. - libvirt doesn't send VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event on the source when switching to post-copy. That means Vdsm doesn't detect that it's in a post-copy phase and doesn't announce the post-copy phase to Engine. I think it's a regression in libvirt, I'm going to send a bug report on libvirt. - When a post-copy migration fails due to the problem from Comment 21 and the VM gets killed on the destination, the VM gets resumed (no, not restarted) on the source host. (This doesn't happen in Nils's case, since the VM can't be destroyed there due to some busy device.) That looks wrong and as a libvirt/QEMU regression. I'll ask libvirt guys about it. - In case the VM gets resumed on the source after attempting to switch to post-copy, it runs happily there and can be migrated again. However Vdsm hasn't cleared its post-copy flag, so it thinks the VM is in post-copy since the beginning of the migration and doesn't apply the selected migration policy. This is trivial to fix, a patch is in preparation. - Memory iteration detection in Vdsm may not be 100% reliable. Since we don't have to support old libvirt versions anymore, we can get info about memory iterations from libvirt and abandon the guesswork. This is not a regression but as we are already on it, it should be improved; a patch is in preparation. (Nils, I'll look at the "Suspend Workload" problems once I handle the post-copy problems, they are too many at the moment.) (Originally by Milan Zamazal)
(In reply to Milan Zamazal from comment #25) > - libvirt doesn't send VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event on the > source when switching to post-copy. That means Vdsm doesn't detect that it's > in a post-copy phase and doesn't announce the post-copy phase to Engine. I > think it's a regression in libvirt, I'm going to send a bug report on > libvirt. > > - When a post-copy migration fails due to the problem from Comment 21 and > the VM gets killed on the destination, the VM gets resumed (no, not > restarted) on the source host. (This doesn't happen in Nils's case, since > the VM can't be destroyed there due to some busy device.) That looks wrong > and as a libvirt/QEMU regression. I'll ask libvirt guys about it. These libvirt problems have already been fixed in one of the recent libvirt downstream updates: https://bugzilla.redhat.com/1647365 So only Vdsm post-copy fixes remain, patches for all of them are posted. (Originally by Milan Zamazal)
(In reply to Michal Skrivanek from comment #24) > Nils, that Suspend case, was the migration back to the original host > triggered manually? Was it intentional or what was the reason? The migration is triggered intentionally. For the SAP certification I am currently doing it is a requirement to migrate the VM 10+ times back and forth while running a high workload benchmark. I use the python API to initiate the migration via calling: vm_to_migrate_service = vms_service.vm_service(vm_to_migrate.id) vm_to_migrate_service.migrate(force=True, host=destination) (Originally by Nils Koenig)
> > So only Vdsm post-copy fixes remain, patches for all of them are posted. > Milan, does that mean all patches are already available? (Originally by Nils Koenig)
All patches for the post-copy problems I discovered are posted. I looked into the logs you posted in Comment 22. The migration failed there because it couldn't converge. The migration couldn't succeeded despite increasing migration downtimes and the convergence schedule has eventually canceled it. That can happen with busy VMs. And since the Vdsm log looks fine to me, I think there is no bug there. (Originally by Milan Zamazal)
As far as I can see the target release for this patches is 4.3. May I ask you to backport this to 4.2? In case of objections, please contact Arne Arnold (aarnold). (Originally by Nils Koenig)
(In reply to Milan Zamazal from comment #29) > All patches for the post-copy problems I discovered are posted. > > I looked into the logs you posted in Comment 22. The migration failed there > because it couldn't converge. The migration couldn't succeeded despite > increasing migration downtimes and the convergence schedule has eventually > canceled it. That can happen with busy VMs. And since the Vdsm log looks > fine to me, I think there is no bug there. Is there a way to increase the timeout for canceling the migration? Would this also happen when migrating a busy VM with post-copy, that a timeout will cancel the migration? Please see also my Comment #30 regarding a backport. (Originally by Nils Koenig)
(In reply to Nils Koenig from comment #31) > Is there a way to increase the timeout for canceling the migration? It's possible to tweak the migration schedule in the database, but I don't know whether it is supported. To be precise, the migration wasn't canceled due to a timeout. Several iterations had been attempted, with increasing downtime, and since the last attempt with 5 seconds downtime had also failed, the migration was aborted. Maybe modifying "Suspended workload" schedule, such as permitting one or two more attempts with 5 seconds downtime to work around temporary spikes, would help. I'm not sure whether we would like that or whether only post-copy is recommended in such a case. It may be also worth to consider whether network capacity is sufficient if it can't transfer enough data even with the quite long downtime. > Would this also happen when migrating a busy VM with post-copy, > that a timeout will cancel the migration? No. Once migration enters post-copy phase, it can't be canceled and it is guaranteed to converge (with the fixes posted here). (In theory, the 6 hour general migration timeout could be hit before entering post-copy, but it shouldn't happen with any reasonable hardware setup.) > Please see also my Comment #30 regarding a backport. I'm aware of it, but I'm not someone who can decide it. Ryan, Michal, Martin? (Originally by Milan Zamazal)
The problem was reproduced on vdsm-4.30.12-2.git3ab3415.el7.x86_64 and the fix is verified on vdsm-4.40.0-208.git002018c.el7.x86_64. Steps to reproduce.: 1. On two hosts allocate about 50-60% of memory and consume about 50% cpu. 2. Run VM configured with post-copy migration policy (I also configured the cluster with this policy). 3. Run VM . 4. slow down the network on one of the hosts with tc command: tc qdisc add dev ovirtmgmt root netem delay 300ms To remove the rule: tc qdisc del dev ovirtmgmt root 5. On the VM allocate the memory with Milan's script : import sys mb = int(sys.argv[1]) size = mb * 1024 * 128 l = [0] * size i = 0 while True: l[i] = (l[i] + 1) % 1024 i = (i + 4095) % size 6. Migrate the VM Result: migration is very slow and enters post-copy mode: Seen in vdsm.log - 2019-05-07 13:49:07,466+0300 INFO (libvirt/events) [virt.vm] (vmId='1eec678f-5d79-4bf9-b563-71ad57fad61c') Migration switched to post-copy mode (vm:5559) No failure . Before the fix we faced "VM golden_env_mixed_virtio_0 is down with error. Exit message: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)." (Originally by Polina Agranat)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops (Originally by rhv-bugzilla-bot)
verified on ovirt-engine-4.3.4.1-0.1.el7.noarch . The same verification description as for https://bugzilla.redhat.com/show_bug.cgi?id=1711792#c33.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1567
sync2jira