Description of problem: A third VM, in this case is a HostedEngine VM, fails to get migrated to another Host in the cluster because of `Incoming migration limit exceeded` Source Host: ============= migsrc/f8e8c21a::INFO::2019-01-08 17:02:50,936::migration::473::virt.vm::(_startUnderlyingMigration) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Creation of destination VM took: 0 seconds migsrc/f8e8c21a::ERROR::2019-01-08 17:02:50,937::migration::290::virt.vm::(_recover) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') migration destination error: Fatal error during migration Destination Host: ================== vm/f8e8c21a::DEBUG::2019-01-08 17:02:51,037::vm::861::virt.vm::(_startUnderlyingVm) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Start vm/f8e8c21a::DEBUG::2019-01-08 17:02:51,037::vm::864::virt.vm::(_startUnderlyingVm) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Acquiring incoming migration semaphore. jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::api::135::api::(method) FINISH create response={'status': {'message': 'Incoming migration limit exceeded', 'code': 82}} jsonrpc/1::INFO::2019-01-08 17:02:51,037::api::52::api.virt::(method) FINISH create return={'status': {'message': 'Incoming migration limit exceeded', 'code': 82}} from=::ffff:xx.xx.xx.xx,60308, vmId=f8e8c21a-b715-4a9f-a168-e4d080f5792d jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::API::540::vds::(migrationCreate) Migration create - Failed jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::API::546::vds::(migrationCreate) Returning backwards compatible migration error code jsonrpc/1::DEBUG::2019-01-08 17:02:51,038::api::135::api::(method) FINISH migrationCreate response={'status': {'message': 'Fatal error during migration', 'code': 12}} jsonrpc/1::INFO::2019-01-08 17:02:51,038::api::52::api.virt::(method) FINISH migrationCreate return={'status': {'message': 'Fatal error during migration', 'code': 12}} from=::ffff:xx.xx.xx.xx,60308, vmId=f8e8c21a-b715-4a9f-a168-e4d080f5792d Version-Release number of selected component (if applicable): rhvm-4.2.7.5-0.1.el7ev.noarch vdsm-4.20.43-1.el7ev.x86_64 Cluster is set with 'minimal downtime' migration policy How reproducible: 100% in user's environment Steps to Reproduce: 1. Have the HE VM running with other 2 VMs under the same host 2. Place the Host into maintenance mode and let the VMs migrate to the second host 3. HE VM is added to the re-run treatment on the same Host where it was running and it won't migrate automatically Actual results: HE VM has to be manually migrated after the first 2 VMs gets started on second host Expected results: The third migration should be queued and triggered later
It’s common it takes several rounds of tries to migrate everything off a host. Why was it done manually? Would it not re-try at all for a long time or were they just impatient?
first two migrations are to *101 host(the log you included) but the HostedEngine VM goes to host *102, it's queued on source host for 10s (until 17:02:50,843) because of those first two migrations, then it proceeds and apparently at that time there are too many incoming migrations on *102. Please include logs from there as well, but it looks likely it's just busy. So the behavior is as expected. The decision to migrate the HE VM is not done by the engine but possibly by the hosted-engine-ha-broker(no logs attached). That could be a reason why the migration is not retriggered
Martin, what's the logic in broker to handle failed migration on local maint? I also thought you've changed that to let engine handle that but apparently in this case it's still the broker calling vdsm and then the scheduling of migrations do not work as well as it would if invoked via engine
Broker only tries once, the admin has to solve it when it fails. We wanted to change it, but it was de-prioritized and not finished.
Thanks for confirmation. Engine will take care of it, but the whole sweep of host is done only in 5 mins interval. Once Maintenance is requested we build a list of VMs and migrate all of them, if anything fails (for regular migrations including the retries, for HE VM after a single failure then) in 5 mins we go through the new list of remaining VMs again and start migrations. In this case the customer apparently did it manually in 1 minute so it didn't kick in. Either way, it looks to me it works fine and as designed. Can we close the bug, Javier?
Thanks for the comments Michal, let me pass this to the user and ask them to give a try and wait to see what's the behaviour they have. Leaving NI on my side.
aah, AFAICT the migration won't be rerun anyway since it wasn't trigered by the engine and hence there is no cfailed comand from engine's perspective and so resourceManager.rerunFailedCommand() does nothing. Arik, does that sound right?
(In reply to Michal Skrivanek from comment #16) > aah, AFAICT the migration won't be rerun anyway since it wasn't trigered by > the engine and hence there is no cfailed comand from engine's perspective > and so resourceManager.rerunFailedCommand() does nothing. Arik, does that > sound right? Right, the engine won't try to migrate the VM but rely on the HA agent to do that [1] - and therefore the engine won't trigger rerun attempts [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MaintenanceVdsCommand.java#L139-L148
Hi Simone, this would fall into ovirt-hosted-engine-ha plans. I see 2 possible solutions, either plan/finish comment #8 (call engine API) or implement a retry logic similar to engine withing the agent.
re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0. If you think this bug should block 4.3.0 please re-target and set blocker flag.
Moving to 4.3.2 not being identified as blocker for 4.3.1.
Not able to reproduce on our systems, we'll keep investigating on this, not blocking 4.3.3.
(In reply to Michal Skrivanek from comment #18) > Hi Simone, this would fall into ovirt-hosted-engine-ha plans. I see 2 > possible solutions, either plan/finish comment #8 (call engine API) or > implement a retry logic similar to engine withing the agent. On ovirt-ha-agent side we don't have API credentials but, on the other side, the engine is already able to control hosted-engine VM migration. The issue arises from ovirt-ha-agent and the engine that act almost at the same time on the engine VM.
sync2jira
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
NFS deployment on these components: rhvm-appliance.x86_64 2:4.4-20200123.0.el8ev rhv-4.4.0 sanlock-3.8.0-2.el8.x86_64 qemu-kvm-4.2.0-12.module+el8.2.0+5858+afd073bc.x86_64 vdsm-4.40.5-1.el8ev.x86_64 libvirt-client-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64 ovirt-hosted-engine-setup-2.4.2-2.el8ev.noarch ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch Linux 4.18.0-183.el8.x86_64 #1 SMP Sun Feb 23 20:50:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 Beta (Ootpa) Engine is: Red Hat Enterprise Linux Server release 7.8 Beta (Maipo) Linux 3.10.0-1123.el7.x86_64 #1 SMP Tue Jan 14 03:44:38 EST 2020 x86_64 x86_64 x86_64 GNU/Linux Result - Engine's VM successfully migrated away, after ha-host had been placed in to maintenance. Source ha-host was SPM and it moved to destination ha-host. I followed reproduction steps 4 times back and forth and bug didn't reproduced. Moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247