1097341 – The start time for 'migration_max_time_per_gib_mem' appears to be calculated too early.

Bug 1097341 - The start time for 'migration_max_time_per_gib_mem' appears to be calculated too early.

Summary: The start time for 'migration_max_time_per_gib_mem' appears to be calculated ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.3.3
Assignee:	Vinzenz Feenstra [evilissimo]
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:	virt
Depends On:	1090109 1097332
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-13 15:14 UTC by Chris Pelland
Modified:	2019-04-28 09:25 UTC (History)
CC List:	16 users (show)
Fixed In Version:	vdsm-4.13.2-0.16.el6ev
Doc Type:	Bug Fix
Doc Text:	* Previously, migration start time was captured at the start of the MigrationSourceThread process. This meant that the migration would fail if the virtual machine had to wait a long time to acquire the migration semaphore. Now, the migration start time is captured when migration begins.
Clone Of:	1097332
Environment:
Last Closed:	2014-05-27 08:57:42 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm.log (753.86 KB, application/x-gzip) 2014-05-20 09:45 UTC, Eldad Marciano	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	869483	None	None	None	Never
Red Hat Knowledge Base (Solution)	873473	None	None	None	Never
Red Hat Product Errata	RHBA-2014:0548	normal	SHIPPED_LIVE	vdsm 3.3.3 bug fix update	2014-05-27 12:56:53 UTC
oVirt gerrit	27135	None	None	None	Never
oVirt gerrit	27637	None	MERGED	virt: Capture migration start time after the semaphore was accquired	Never

Comment 5 Eldad Marciano 2014-05-20 09:40:16 UTC

100 % reproduced the bug.

As described above,

Without the fix, 'migration_max_time_per_gib_mem'(timeout) compute for all vm's that vdsm should migrate.

With the fix vdsm compute the timeout per vm that should be migrate.

In order to make vdsm calculate the timeout as failed,
we compute how much time takes to copy 1 single vm with 1gb ram (idle) = ~9 sec
changing the 'migration_max_time_per_gib_mem' to 15 (9+buffer) in order to make migration failure.


Using the same use case:
-migrating 6 vms 
-without the fix 3 of them fail to migrate
-with the fix all of them migrate.

see the logs on failure (attached whole log):
Thread-131::DEBUG::2014-05-20 09:01:37,459::vm::377::vm.Vm::(_startUnderlyingMigration) vmId=`69da39b5-8633-4d2d-b469-54550bf67fef`::starting migration to qemu+tls://host27-rack06.scale.openstack.engineering.redhat.com/system with miguri tcp://host27-rack06.scale.openstack.engineering.redhat.com

Thread-133::DEBUG::2014-05-20 09:01:37,501::vm::377::vm.Vm::(_startUnderlyingMigration) vmId=`24bc780a-1dfc-4d8a-94ca-0bc65fa9b76b`::starting migration to qemu+tls://host27-rack06.scale.openstack.engineering.redhat.com/system with miguri tcp://host27-rack06.scale.openstack.engineering.redhat.com

Thread-136::DEBUG::2014-05-20 09:01:38,111::vm::377::vm.Vm::(_startUnderlyingMigration) vmId=`a722fcca-8224-450c-9261-70ae94b5711d`::starting migration to qemu+tls://host27-rack06.scale.openstack.engineering.redhat.com/system with miguri tcp://host27-rack06.scale.openstack.engineering.redhat.com

Thread-142::DEBUG::2014-05-20 09:01:53,905::vm::377::vm.Vm::(_startUnderlyingMigration) vmId=`51a42fb1-569f-4fa5-b306-91e24bcfedf9`::starting migration to qemu+tls://host27-rack06.scale.openstack.engineering.redhat.com/system with miguri tcp://host27-rack06.scale.openstack.engineering.redhat.com

Thread-150::DEBUG::2014-05-20 09:01:57,344::vm::377::vm.Vm::(_startUnderlyingMigration) vmId=`ffab1eb0-fdb8-46d3-a39e-73c8a76adf58`::starting migration to qemu+tls://host27-rack06.scale.openstack.engineering.redhat.com/system with miguri tcp://host27-rack06.scale.openstack.engineering.redhat.com

Thread-146::DEBUG::2014-05-20 09:01:57,352::vm::377::vm.Vm::(_startUnderlyingMigration) vmId=`fbca5fc9-c163-419d-8992-62cdc3d61fe6`::starting migration to qemu+tls://host27-rack06.scale.openstack.engineering.redhat.com/system with miguri tcp://host27-rack06.scale.openstack.engineering.redhat.com



here can we saw the timeout (in total - per all of the vms) expired.
Thread-156::WARNING::2014-05-20 09:02:03,909::vm::805::vm.Vm::(run) vmId=`51a42fb1-569f-4fa5-b306-91e24bcfedf9`::The migration took 26 seconds which is exceeding the configured maximum time for migrations of 15 seconds. The migration will be aborted.

Thread-158::WARNING::2014-05-20 09:02:07,362::vm::805::vm.Vm::(run) vmId=`ffab1eb0-fdb8-46d3-a39e-73c8a76adf58`::The migration took 27 seconds which is exceeding the configured maximum time for migrations of 15 seconds. The migration will be aborted.

Thread-160::WARNING::2014-05-20 09:02:07,369::vm::805::vm.Vm::(run) vmId=`fbca5fc9-c163-419d-8992-62cdc3d61fe6`::The migration took 28 seconds which is exceeding the configured maximum time for migrations of 15 seconds. The migration will be aborted.

Comment 6 Eldad Marciano 2014-05-20 09:45:00 UTC

Created attachment 897521 [details]
vdsm.log

Comment 7 Eldad Marciano 2014-05-20 14:47:26 UTC

-build is 36.4 installed
-reducing the time out in order to reproduced the problem for 1gb ram.
-bug fixed.
-migration time not not expired, multiple migration pass.
-code fix in vm.py located.

Comment 9 errata-xmlrpc 2014-05-27 08:57:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0548.html

Note You need to log in before you can comment on or make changes to this bug.