970711 – [RFE] Report downtime for each live migration

Bug 970711 - [RFE] Report downtime for each live migration

Summary: [RFE] Report downtime for each live migration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	RFEs
Sub Component:
Version:	3.1.4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Shahar Havivi
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:	1063486 1063724 1138570 1162588 1208772 1213434
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-04 16:51 UTC by Julio Entrena Perez
Modified:	2020-04-15 14:08 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Release Note
Doc Text:	With this release, the downtime during a virtual machine migration is reported. This is the duration of the handover time needed to transfer the execution from the source host to the destination host (the last phase of migration). Note: as part of this enhancement a more strict clock synchronization is enforced between the Manager and hosts. Previously, there was an alert when the host was 5 minutes off the Manager time; now it is 100 ms. The reason is that for accurate downtime reporting the source and destination hosts must have the same clock time. This may cause a lot of new alerts in environments which are not configured properly. The configuration option (used in engine-config) has changed from 'HostTimeDriftInSec' to 'HostTimeDriftInMS'.
Clone Of:
Environment:
Last Closed:	2016-03-09 20:31:29 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	istein: needinfo+ sherold: Triaged+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	406563	None	None	None	Never
Red Hat Product Errata	RHEA-2016:0376	normal	SHIPPED_LIVE	Red Hat Enterprise Virtualization Manager 3.6.0	2016-03-10 01:20:52 UTC
oVirt gerrit	37075	master	ABANDONED	RFE: Report downtime for each live migration	Never
oVirt gerrit	38057	master	ABANDONED	RFE: Report downtime for each live migration	Never
oVirt gerrit	38405	master	ABANDONED	Set default sync time between hosts to 100ms	Never
oVirt gerrit	40100	master	MERGED	RFE: Report downtime for each live migration	Never
oVirt gerrit	40103	master	ABANDONED	RFE: Report downtime for each live migration	Never
oVirt gerrit	41415	master	MERGED	Report downtime for each live migration	Never

Comment 1 Michal Skrivanek 2013-06-12 05:23:19 UTC

libvirt provides "expected downtime" as part of job statistics, would that be enough? It's not the exact number though.

Comment 4 Julio Entrena Perez 2013-06-12 09:34:24 UTC

(In reply to Michal Skrivanek from comment #1)
> libvirt provides "expected downtime" as part of job statistics, would that
> be enough? It's not the exact number though.

No, this request is for webadmin portal to report in the Events section the incurred downtime by each live migration.

Comment 5 Michal Skrivanek 2013-07-03 10:40:40 UTC

(In reply to Julio Entrena Perez from comment #4)
> (In reply to Michal Skrivanek from comment #1)
> > libvirt provides "expected downtime" as part of job statistics, would that
> > be enough? It's not the exact number though.
> 
> No, this request is for webadmin portal to report in the Events section the
> incurred downtime by each live migration.
the need to see that in portal is understood.
Correlating timestamps from src and dst hosts would be difficult. We're polling for task status periodically so we can use the last one as a really close estimate. In most cases this should correspond to the real downtime

Other possibility is to report it afterwards.
If we do it in RHEV-M it still may be misleading if src and dst host time differs. IMHO libvirt/qemu should provide such value if it needs to be really exact

Comment 6 Julio Entrena Perez 2013-07-03 10:51:38 UTC

(In reply to Michal Skrivanek from comment #5)

> Other possibility is to report it afterwards.

That's indeed what the customer expects: downtime reported after live migration completion.

Currently RHEV-M webadmin portal reports the following in the Events section after a successful live migration:

Migration complete (VM: vm_name, Source Host: host_name)

They expect to see:

Migration complete (VM: vm_name, Source Host: host_name, Downtime xxx ms)

Comment 7 Shahar Havivi 2013-07-03 13:16:00 UTC

posted at: http://gerrit.ovirt.org/#/c/16399

Comment 8 Julio Entrena Perez 2013-07-12 12:36:58 UTC

(In reply to Shahar Havivi from comment #7)
> posted at: http://gerrit.ovirt.org/#/c/16399

Is this measuring the time elapsed between a VM is suspended in source host and the VM is resumed in destination host?

Proposed patch seems to be measuring the duration of the entire live migration.

This request is to report the *downtime* experienced by the VM during the live migration, that is the amount of time the VM is not running in any of the hosts, or in other words, the amount of time between the "Suspended" event in source host and the "Resumed" event in destination host.

Comment 9 Shahar Havivi 2013-07-14 07:35:17 UTC

(In reply to Julio Entrena Perez from comment #8)
You are right,
There will be different patch for this bug.

This patch may be posted because it give the user additional info for the time that the migration took time.

Comment 10 Julio Entrena Perez 2013-07-15 08:47:14 UTC

(In reply to Shahar Havivi from comment #9)
> (In reply to Julio Entrena Perez from comment #8)
> You are right,
> There will be different patch for this bug.
Thanks for clarifying this.
> 
> This patch may be posted because it give the user additional info for the
> time that the migration took time.
Thanks Shahar, customer would welcome RHEV-M reporting the duration of the entire live migration too in addition to the incurred downtime during it.

Comment 12 Arthur Berezin 2014-01-30 16:48:27 UTC

Scott, is this scoped for 3.5 ?

Comment 16 Michal Skrivanek 2015-03-04 11:03:48 UTC

one more thing - we should ensure hosts time are in sync. Currently we alert when the drift is 300s, that's too much, we need something like 100ms...

Comment 17 Michal Skrivanek 2015-03-05 14:59:54 UTC

setting Release note flag since we must mention the change of time drift tolerance from 5 mins to 100ms

Comment 18 Omer Frenkel 2015-03-08 11:58:19 UTC

(In reply to Michal Skrivanek from comment #17)
> setting Release note flag since we must mention the change of time drift
> tolerance from 5 mins to 100ms

maybe also worth noting that accordingly, the name of the configuration option changed from
HostTimeDriftInSec
to
HostTimeDriftInMS

(when using engine-config)

Comment 19 Michal Skrivanek 2015-04-21 07:21:22 UTC

see enhancement in libvirt reporting (bug 1213434), should provide more accurate numbers

Comment 21 Max Kovgan 2015-06-28 14:12:29 UTC

ovirt-3.6.0-3 release

Comment 22 Israel Pinto 2015-12-10 13:18:34 UTC

Verify with:
Setup:
RHEVM Version: 3.6.1.2-0.1.el6 
vdsm:vdsm-4.17.13-1.el7ev
libvirt:libvirt-1.2.17-13.el7_2.2

Test cases according to Polarion test case.

restuls: PASS

Comment 24 errata-xmlrpc 2016-03-09 20:31:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html

Note You need to log in before you can comment on or make changes to this bug.