1721362 – [downstream clone - 4.3.5] Third VM fails to get migrated when host is placed into maintenance mode

Bug 1721362 - [downstream clone - 4.3.5] Third VM fails to get migrated when host is placed into maintenance mode

Summary: [downstream clone - 4.3.5] Third VM fails to get migrated when host is placed...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.3.5
Target Release:	4.3.5
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:	1664479
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-18 05:55 UTC by RHV bug bot
Modified:	2020-08-03 15:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ovirt-hosted-engine-setup-2.3.11
Doc Type:	Bug Fix
Doc Text:	When the host running the engine Virtual Machine was set into Maintenance Mode from the engine, the engine Virtual Machine was going to be migrated by the ovirt-ha-agent as an indirect action caused by Maintenance Mode. In this release, the engine has full control of the migration process.
Clone Of:	1664479
Environment:
Last Closed:	2019-08-12 11:53:28 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:2431	None	None	None	2019-08-12 11:53:40 UTC
oVirt gerrit	99404	master	MERGED	Remove the support to start a migration	2020-02-25 14:27:40 UTC
oVirt gerrit	99462	master	MERGED	Avoid entering local mainteance mode if the engine VM is there	2020-02-25 14:27:45 UTC
oVirt gerrit	99523	master	MERGED	he: avoid skipping HE VM migration on maintenance	2020-02-25 14:27:47 UTC
oVirt gerrit	100299	ovirt-engine-4.3	MERGED	he: avoid skipping HE VM migration on maintenance	2020-02-25 14:27:40 UTC
oVirt gerrit	100300	ovirt-hosted-engine-setup-2.3	MERGED	Avoid entering local mainteance mode if the engine VM is there	2020-02-25 14:27:47 UTC
oVirt gerrit	100301	v2.3.z	MERGED	Remove the support to start a migration	2020-02-25 14:27:45 UTC

Description RHV bug bot 2019-06-18 05:55:33 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1664479 +++
======================================================================

Description of problem:


A third VM, in this case is a HostedEngine VM, fails to get migrated to another Host in the cluster because of `Incoming migration limit exceeded`


Source Host:
=============
migsrc/f8e8c21a::INFO::2019-01-08 17:02:50,936::migration::473::virt.vm::(_startUnderlyingMigration) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Creation of destination VM took: 0 seconds
migsrc/f8e8c21a::ERROR::2019-01-08 17:02:50,937::migration::290::virt.vm::(_recover) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') migration destination error: Fatal error during migration



Destination Host:
==================
vm/f8e8c21a::DEBUG::2019-01-08 17:02:51,037::vm::861::virt.vm::(_startUnderlyingVm) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Start
vm/f8e8c21a::DEBUG::2019-01-08 17:02:51,037::vm::864::virt.vm::(_startUnderlyingVm) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Acquiring incoming migration semaphore.
jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::api::135::api::(method) FINISH create response={'status': {'message': 'Incoming migration limit exceeded', 'code': 82}}
jsonrpc/1::INFO::2019-01-08 17:02:51,037::api::52::api.virt::(method) FINISH create return={'status': {'message': 'Incoming migration limit exceeded', 'code': 82}} from=::ffff:xx.xx.xx.xx,60308, vmId=f8e8c21a-b715-4a9f-a168-e4d080f5792d
jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::API::540::vds::(migrationCreate) Migration create - Failed
jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::API::546::vds::(migrationCreate) Returning backwards compatible migration error code
jsonrpc/1::DEBUG::2019-01-08 17:02:51,038::api::135::api::(method) FINISH migrationCreate response={'status': {'message': 'Fatal error during migration', 'code': 12}}
jsonrpc/1::INFO::2019-01-08 17:02:51,038::api::52::api.virt::(method) FINISH migrationCreate return={'status': {'message': 'Fatal error during migration', 'code': 12}} from=::ffff:xx.xx.xx.xx,60308, vmId=f8e8c21a-b715-4a9f-a168-e4d080f5792d



Version-Release number of selected component (if applicable):


rhvm-4.2.7.5-0.1.el7ev.noarch

vdsm-4.20.43-1.el7ev.x86_64

Cluster is set with 'minimal downtime' migration policy 


How reproducible:

100% in user's environment

Steps to Reproduce:
1. Have the HE VM running with other 2 VMs under the same host
2. Place the Host into maintenance mode and let the VMs migrate to the second host
3. HE VM is added to the re-run treatment on the same Host where it was running and it won't migrate automatically



Actual results:

HE VM has to be manually migrated after the first 2 VMs gets started on second host

Expected results:

The third migration should be queued and triggered later

(Originally by Javier Coscia)

Comment 5 RHV bug bot 2019-06-18 05:55:42 UTC

It’s common it takes several rounds of tries to migrate everything off a host. Why was it done manually? Would it not re-try at all for a long time or were they just impatient?

(Originally by michal.skrivanek)

Comment 6 RHV bug bot 2019-06-18 05:55:44 UTC

first two migrations are to *101 host(the log you included) but the HostedEngine VM goes to host *102, it's queued on source host for 10s (until 17:02:50,843) because of those first two migrations, then it proceeds and apparently at that time there are too many incoming migrations on *102. Please include logs from there as well, but it looks likely it's just busy. So the behavior is as expected.

The decision to migrate the HE VM is not done by the engine but possibly by the hosted-engine-ha-broker(no logs attached). That could be a reason why the migration is not retriggered

(Originally by michal.skrivanek)

Comment 7 RHV bug bot 2019-06-18 05:55:46 UTC

Martin, what's the logic in broker to handle failed migration on local maint? I also thought you've changed that to let engine handle that but apparently in this case it's still the broker calling vdsm and then the scheduling of migrations do not work as well as it would if invoked via engine

(Originally by michal.skrivanek)

Comment 8 RHV bug bot 2019-06-18 05:55:47 UTC

Broker only tries once, the admin has to solve it when it fails. We wanted to change it, but it was de-prioritized and not finished.

(Originally by Martin Sivak)

Comment 9 RHV bug bot 2019-06-18 05:55:49 UTC

Thanks for confirmation. Engine will take care of it, but the whole sweep of host is done only in 5 mins interval. Once Maintenance is requested we build a list of VMs and migrate all of them, if anything fails (for regular migrations including the retries, for HE VM after a single failure then) in 5 mins we go through the new list of remaining VMs again and start migrations. In this case the customer apparently did it manually in 1 minute so it didn't kick in. 
Either way, it looks to me it works fine and as designed. 
Can we close the bug, Javier?

(Originally by michal.skrivanek)

Comment 10 RHV bug bot 2019-06-18 05:55:51 UTC

Thanks for the comments Michal, let me pass this to the user and ask them to give a try and wait to see what's the behaviour they have. Leaving NI on my side.

(Originally by Javier Coscia)

Comment 16 RHV bug bot 2019-06-18 05:56:01 UTC

aah, AFAICT the migration won't be rerun anyway since it wasn't trigered by the engine and hence there is no cfailed comand from engine's perspective and so resourceManager.rerunFailedCommand() does nothing. Arik, does that sound right?

(Originally by michal.skrivanek)

Comment 17 RHV bug bot 2019-06-18 05:56:03 UTC

(In reply to Michal Skrivanek from comment #16)
> aah, AFAICT the migration won't be rerun anyway since it wasn't trigered by
> the engine and hence there is no cfailed comand from engine's perspective
> and so resourceManager.rerunFailedCommand() does nothing. Arik, does that
> sound right?

Right, the engine won't try to migrate the VM but rely on the HA agent to do that [1] - and therefore the engine won't trigger rerun attempts

[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MaintenanceVdsCommand.java#L139-L148

(Originally by Arik Hadas)

Comment 18 RHV bug bot 2019-06-18 05:56:05 UTC

Hi Simone, this would fall into ovirt-hosted-engine-ha plans. I see 2 possible solutions, either plan/finish comment #8 (call engine API) or implement a retry logic similar to engine withing the agent.

(Originally by michal.skrivanek)

Comment 19 RHV bug bot 2019-06-18 05:56:07 UTC

re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0.
If you think this bug should block 4.3.0 please re-target and set blocker flag.

(Originally by Sandro Bonazzola)

Comment 21 RHV bug bot 2019-06-18 05:56:11 UTC

Moving to 4.3.2 not being identified as blocker for 4.3.1.

(Originally by Sandro Bonazzola)

Comment 22 RHV bug bot 2019-06-18 05:56:12 UTC

Not able to reproduce on our systems, we'll keep investigating on this, not blocking 4.3.3.

(Originally by Sandro Bonazzola)

Comment 23 RHV bug bot 2019-06-18 05:56:14 UTC

(In reply to Michal Skrivanek from comment #18)
> Hi Simone, this would fall into ovirt-hosted-engine-ha plans. I see 2
> possible solutions, either plan/finish comment #8 (call engine API) or
> implement a retry logic similar to engine withing the agent.

On ovirt-ha-agent side we don't have API credentials but, on the other side, the engine is already able to control hosted-engine VM migration.
The issue arises from ovirt-ha-agent and the engine that act almost at the same time on the engine VM.

(Originally by Simone Tiraboschi)

Comment 29 Nikolai Sednev 2019-07-15 14:30:45 UTC

Works for me on these components:
ovirt-engine-setup-4.3.5.4-0.1.el7.noarch
ovirt-hosted-engine-ha-2.3.3-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.11-1.el7ev.noarch
Linux 3.10.0-1060.el7.x86_64 #1 SMP Mon Jul 1 18:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.7 Beta (Maipo)

Comment 33 errata-xmlrpc 2019-08-12 11:53:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2431

Comment 34 Daniel Gur 2019-08-28 13:14:15 UTC

sync2jira

Comment 35 Daniel Gur 2019-08-28 13:18:32 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.