1664479 – Third VM fails to get migrated when host is placed into maintenance mode

Bug 1664479 - Third VM fails to get migrated when host is placed into maintenance mode

Summary: Third VM fails to get migrated when host is placed into maintenance mode

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.0
Target Release:	4.4.0
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:	1795672
Blocks:	1721362 1726988
TreeView+	depends on / blocked

Reported:	2019-01-08 22:59 UTC by Javier Coscia
Modified:	2020-11-13 06:08 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ovirt-hosted-engine-setup-2.4.0, ovirt-hosted-engine-ha-2.4.0
Doc Type:	Bug Fix
Doc Text:	When you use the engine ("Master") to set the high-availability host running the engine virtual machine (VM) to maintenance mode, the ovirt-ha-agent migrates the engine VM to another host. Previously, in specific cases, such as when these VMs had an old compatibility version, this type of migration failed. The current release fixes this problem.
Clone Of:
Clones:	1721362 (view as bug list)
Environment:
Last Closed:	2020-08-04 13:16:51 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:3247	None	None	None	2020-08-04 13:17:21 UTC
oVirt gerrit	99404	'None'	MERGED	Remove the support to start a migration	2021-02-21 03:26:44 UTC
oVirt gerrit	99462	'None'	MERGED	Avoid entering local mainteance mode if the engine VM is there	2021-02-21 03:26:44 UTC
oVirt gerrit	99523	'None'	MERGED	he: avoid skipping HE VM migration on maintenance	2021-02-21 03:26:45 UTC
oVirt gerrit	100299	'None'	MERGED	he: avoid skipping HE VM migration on maintenance	2021-02-21 03:26:44 UTC
oVirt gerrit	100300	'None'	MERGED	Avoid entering local mainteance mode if the engine VM is there	2021-02-21 03:26:44 UTC
oVirt gerrit	100301	'None'	MERGED	Remove the support to start a migration	2021-02-21 03:26:44 UTC

Description Javier Coscia 2019-01-08 22:59:50 UTC

Description of problem:


A third VM, in this case is a HostedEngine VM, fails to get migrated to another Host in the cluster because of `Incoming migration limit exceeded`


Source Host:
=============
migsrc/f8e8c21a::INFO::2019-01-08 17:02:50,936::migration::473::virt.vm::(_startUnderlyingMigration) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Creation of destination VM took: 0 seconds
migsrc/f8e8c21a::ERROR::2019-01-08 17:02:50,937::migration::290::virt.vm::(_recover) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') migration destination error: Fatal error during migration



Destination Host:
==================
vm/f8e8c21a::DEBUG::2019-01-08 17:02:51,037::vm::861::virt.vm::(_startUnderlyingVm) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Start
vm/f8e8c21a::DEBUG::2019-01-08 17:02:51,037::vm::864::virt.vm::(_startUnderlyingVm) (vmId='f8e8c21a-b715-4a9f-a168-e4d080f5792d') Acquiring incoming migration semaphore.
jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::api::135::api::(method) FINISH create response={'status': {'message': 'Incoming migration limit exceeded', 'code': 82}}
jsonrpc/1::INFO::2019-01-08 17:02:51,037::api::52::api.virt::(method) FINISH create return={'status': {'message': 'Incoming migration limit exceeded', 'code': 82}} from=::ffff:xx.xx.xx.xx,60308, vmId=f8e8c21a-b715-4a9f-a168-e4d080f5792d
jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::API::540::vds::(migrationCreate) Migration create - Failed
jsonrpc/1::DEBUG::2019-01-08 17:02:51,037::API::546::vds::(migrationCreate) Returning backwards compatible migration error code
jsonrpc/1::DEBUG::2019-01-08 17:02:51,038::api::135::api::(method) FINISH migrationCreate response={'status': {'message': 'Fatal error during migration', 'code': 12}}
jsonrpc/1::INFO::2019-01-08 17:02:51,038::api::52::api.virt::(method) FINISH migrationCreate return={'status': {'message': 'Fatal error during migration', 'code': 12}} from=::ffff:xx.xx.xx.xx,60308, vmId=f8e8c21a-b715-4a9f-a168-e4d080f5792d



Version-Release number of selected component (if applicable):


rhvm-4.2.7.5-0.1.el7ev.noarch

vdsm-4.20.43-1.el7ev.x86_64

Cluster is set with 'minimal downtime' migration policy 


How reproducible:

100% in user's environment

Steps to Reproduce:
1. Have the HE VM running with other 2 VMs under the same host
2. Place the Host into maintenance mode and let the VMs migrate to the second host
3. HE VM is added to the re-run treatment on the same Host where it was running and it won't migrate automatically



Actual results:

HE VM has to be manually migrated after the first 2 VMs gets started on second host

Expected results:

The third migration should be queued and triggered later

Comment 5 Michal Skrivanek 2019-01-09 05:59:37 UTC

It’s common it takes several rounds of tries to migrate everything off a host. Why was it done manually? Would it not re-try at all for a long time or were they just impatient?

Comment 6 Michal Skrivanek 2019-01-09 08:04:24 UTC

first two migrations are to *101 host(the log you included) but the HostedEngine VM goes to host *102, it's queued on source host for 10s (until 17:02:50,843) because of those first two migrations, then it proceeds and apparently at that time there are too many incoming migrations on *102. Please include logs from there as well, but it looks likely it's just busy. So the behavior is as expected.

The decision to migrate the HE VM is not done by the engine but possibly by the hosted-engine-ha-broker(no logs attached). That could be a reason why the migration is not retriggered

Comment 7 Michal Skrivanek 2019-01-09 08:13:05 UTC

Martin, what's the logic in broker to handle failed migration on local maint? I also thought you've changed that to let engine handle that but apparently in this case it's still the broker calling vdsm and then the scheduling of migrations do not work as well as it would if invoked via engine

Comment 8 Martin Sivák 2019-01-09 08:20:29 UTC

Broker only tries once, the admin has to solve it when it fails. We wanted to change it, but it was de-prioritized and not finished.

Comment 9 Michal Skrivanek 2019-01-09 09:43:54 UTC

Thanks for confirmation. Engine will take care of it, but the whole sweep of host is done only in 5 mins interval. Once Maintenance is requested we build a list of VMs and migrate all of them, if anything fails (for regular migrations including the retries, for HE VM after a single failure then) in 5 mins we go through the new list of remaining VMs again and start migrations. In this case the customer apparently did it manually in 1 minute so it didn't kick in. 
Either way, it looks to me it works fine and as designed. 
Can we close the bug, Javier?

Comment 10 Javier Coscia 2019-01-09 13:15:00 UTC

Thanks for the comments Michal, let me pass this to the user and ask them to give a try and wait to see what's the behaviour they have. Leaving NI on my side.

Comment 16 Michal Skrivanek 2019-01-16 08:37:07 UTC

aah, AFAICT the migration won't be rerun anyway since it wasn't trigered by the engine and hence there is no cfailed comand from engine's perspective and so resourceManager.rerunFailedCommand() does nothing. Arik, does that sound right?

Comment 17 Arik 2019-01-16 08:52:14 UTC

(In reply to Michal Skrivanek from comment #16)
> aah, AFAICT the migration won't be rerun anyway since it wasn't trigered by
> the engine and hence there is no cfailed comand from engine's perspective
> and so resourceManager.rerunFailedCommand() does nothing. Arik, does that
> sound right?

Right, the engine won't try to migrate the VM but rely on the HA agent to do that [1] - and therefore the engine won't trigger rerun attempts

[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MaintenanceVdsCommand.java#L139-L148

Comment 18 Michal Skrivanek 2019-01-16 12:21:31 UTC

Hi Simone, this would fall into ovirt-hosted-engine-ha plans. I see 2 possible solutions, either plan/finish comment #8 (call engine API) or implement a retry logic similar to engine withing the agent.

Comment 19 Sandro Bonazzola 2019-01-21 08:28:36 UTC

re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0.
If you think this bug should block 4.3.0 please re-target and set blocker flag.

Comment 21 Sandro Bonazzola 2019-02-18 07:54:51 UTC

Moving to 4.3.2 not being identified as blocker for 4.3.1.

Comment 22 Sandro Bonazzola 2019-03-20 08:17:29 UTC

Not able to reproduce on our systems, we'll keep investigating on this, not blocking 4.3.3.

Comment 23 Simone Tiraboschi 2019-04-12 15:41:51 UTC

(In reply to Michal Skrivanek from comment #18)
> Hi Simone, this would fall into ovirt-hosted-engine-ha plans. I see 2
> possible solutions, either plan/finish comment #8 (call engine API) or
> implement a retry logic similar to engine withing the agent.

On ovirt-ha-agent side we don't have API credentials but, on the other side, the engine is already able to control hosted-engine VM migration.
The issue arises from ovirt-ha-agent and the engine that act almost at the same time on the engine VM.

Comment 29 Daniel Gur 2019-08-28 13:12:38 UTC

sync2jira

Comment 30 Daniel Gur 2019-08-28 13:16:50 UTC

sync2jira

Comment 33 RHV bug bot 2019-12-13 13:16:08 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 34 RHV bug bot 2019-12-20 17:45:39 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 35 RHV bug bot 2020-01-08 14:48:26 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 36 RHV bug bot 2020-01-08 15:13:55 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 37 RHV bug bot 2020-01-24 19:50:27 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 38 Nikolai Sednev 2020-03-01 14:21:30 UTC

NFS deployment on these components:
rhvm-appliance.x86_64 2:4.4-20200123.0.el8ev rhv-4.4.0                                               
sanlock-3.8.0-2.el8.x86_64
qemu-kvm-4.2.0-12.module+el8.2.0+5858+afd073bc.x86_64
vdsm-4.40.5-1.el8ev.x86_64
libvirt-client-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64
ovirt-hosted-engine-setup-2.4.2-2.el8ev.noarch
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
Linux 4.18.0-183.el8.x86_64 #1 SMP Sun Feb 23 20:50:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

Engine is:
Red Hat Enterprise Linux Server release 7.8 Beta (Maipo)
Linux 3.10.0-1123.el7.x86_64 #1 SMP Tue Jan 14 03:44:38 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

Result - Engine's VM successfully migrated away, after ha-host had been placed in to maintenance.
Source ha-host was SPM and it moved to destination ha-host.
I followed reproduction steps 4 times back and forth and bug didn't reproduced.
Moving to verified.

Comment 44 errata-xmlrpc 2020-08-04 13:16:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247

Note You need to log in before you can comment on or make changes to this bug.