+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1664479 +++ ====================================================================== Description of problem: A third VM, in this case is a HostedEngine VM, fails to get migrated to another Host in the cluster because of `Incoming migration limit exceeded` Source Host: ============= migsrc/f8e8c21a::INFO::2019-01-08 17:02:50,936::migration::473::virt.vm::(_startUnderlyingMigration) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Creation of destination VM took: 0 seconds migsrc/f8e8c21a::ERROR::2019-01-08 17:02:50,937::migration::290::virt.vm::(_recover) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') migration destination error: Fatal error during migration Destination Host: ================== vm/f8e8c21a::DEBUG::2019-01-08 17:02:51,037::vm::861::virt.vm::(_startUnderlyingVm) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Start vm/f8e8c21a::DEBUG::2019-01-08 17:02:51,037::vm::864::virt.vm::(_startUnderlyingVm) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Acquiring incoming migration semaphore. jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::api::135::api::(method) FINISH create response={'status': {'message': 'Incoming migration limit exceeded', 'code': 82}} jsonrpc/1::INFO::2019-01-08 17:02:51,037::api::52::api.virt::(method) FINISH create return={'status': {'message': 'Incoming migration limit exceeded', 'code': 82}} from=::ffff:xx.xx.xx.xx,60308, vmId=f8e8c21a-b715-4a9f-a168-e4d080f5792d jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::API::540::vds::(migrationCreate) Migration create - Failed jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::API::546::vds::(migrationCreate) Returning backwards compatible migration error code jsonrpc/1::DEBUG::2019-01-08 17:02:51,038::api::135::api::(method) FINISH migrationCreate response={'status': {'message': 'Fatal error during migration', 'code': 12}} jsonrpc/1::INFO::2019-01-08 17:02:51,038::api::52::api.virt::(method) FINISH migrationCreate return={'status': {'message': 'Fatal error during migration', 'code': 12}} from=::ffff:xx.xx.xx.xx,60308, vmId=f8e8c21a-b715-4a9f-a168-e4d080f5792d Version-Release number of selected component (if applicable): rhvm-4.2.7.5-0.1.el7ev.noarch vdsm-4.20.43-1.el7ev.x86_64 Cluster is set with 'minimal downtime' migration policy How reproducible: 100% in user's environment Steps to Reproduce: 1. Have the HE VM running with other 2 VMs under the same host 2. Place the Host into maintenance mode and let the VMs migrate to the second host 3. HE VM is added to the re-run treatment on the same Host where it was running and it won't migrate automatically Actual results: HE VM has to be manually migrated after the first 2 VMs gets started on second host Expected results: The third migration should be queued and triggered later (Originally by Javier Coscia)
Itβs common it takes several rounds of tries to migrate everything off a host. Why was it done manually? Would it not re-try at all for a long time or were they just impatient? (Originally by michal.skrivanek)
first two migrations are to *101 host(the log you included) but the HostedEngine VM goes to host *102, it's queued on source host for 10s (until 17:02:50,843) because of those first two migrations, then it proceeds and apparently at that time there are too many incoming migrations on *102. Please include logs from there as well, but it looks likely it's just busy. So the behavior is as expected. The decision to migrate the HE VM is not done by the engine but possibly by the hosted-engine-ha-broker(no logs attached). That could be a reason why the migration is not retriggered (Originally by michal.skrivanek)
Martin, what's the logic in broker to handle failed migration on local maint? I also thought you've changed that to let engine handle that but apparently in this case it's still the broker calling vdsm and then the scheduling of migrations do not work as well as it would if invoked via engine (Originally by michal.skrivanek)
Broker only tries once, the admin has to solve it when it fails. We wanted to change it, but it was de-prioritized and not finished. (Originally by Martin Sivak)
Thanks for confirmation. Engine will take care of it, but the whole sweep of host is done only in 5 mins interval. Once Maintenance is requested we build a list of VMs and migrate all of them, if anything fails (for regular migrations including the retries, for HE VM after a single failure then) in 5 mins we go through the new list of remaining VMs again and start migrations. In this case the customer apparently did it manually in 1 minute so it didn't kick in. Either way, it looks to me it works fine and as designed. Can we close the bug, Javier? (Originally by michal.skrivanek)
Thanks for the comments Michal, let me pass this to the user and ask them to give a try and wait to see what's the behaviour they have. Leaving NI on my side. (Originally by Javier Coscia)
aah, AFAICT the migration won't be rerun anyway since it wasn't trigered by the engine and hence there is no cfailed comand from engine's perspective and so resourceManager.rerunFailedCommand() does nothing. Arik, does that sound right? (Originally by michal.skrivanek)
(In reply to Michal Skrivanek from comment #16) > aah, AFAICT the migration won't be rerun anyway since it wasn't trigered by > the engine and hence there is no cfailed comand from engine's perspective > and so resourceManager.rerunFailedCommand() does nothing. Arik, does that > sound right? Right, the engine won't try to migrate the VM but rely on the HA agent to do that [1] - and therefore the engine won't trigger rerun attempts [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MaintenanceVdsCommand.java#L139-L148 (Originally by Arik Hadas)
Hi Simone, this would fall into ovirt-hosted-engine-ha plans. I see 2 possible solutions, either plan/finish comment #8 (call engine API) or implement a retry logic similar to engine withing the agent. (Originally by michal.skrivanek)
re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0. If you think this bug should block 4.3.0 please re-target and set blocker flag. (Originally by Sandro Bonazzola)
Moving to 4.3.2 not being identified as blocker for 4.3.1. (Originally by Sandro Bonazzola)
Not able to reproduce on our systems, we'll keep investigating on this, not blocking 4.3.3. (Originally by Sandro Bonazzola)
(In reply to Michal Skrivanek from comment #18) > Hi Simone, this would fall into ovirt-hosted-engine-ha plans. I see 2 > possible solutions, either plan/finish comment #8 (call engine API) or > implement a retry logic similar to engine withing the agent. On ovirt-ha-agent side we don't have API credentials but, on the other side, the engine is already able to control hosted-engine VM migration. The issue arises from ovirt-ha-agent and the engine that act almost at the same time on the engine VM. (Originally by Simone Tiraboschi)
Works for me on these components: ovirt-engine-setup-4.3.5.4-0.1.el7.noarch ovirt-hosted-engine-ha-2.3.3-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.11-1.el7ev.noarch Linux 3.10.0-1060.el7.x86_64 #1 SMP Mon Jul 1 18:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.7 Beta (Maipo)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2431
sync2jira