Bug 970711

Summary:	[RFE] Report downtime for each live migration
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Julio Entrena Perez <jentrena>
Component:	RFEs	Assignee:	Shahar Havivi <shavivi>
Status:	CLOSED ERRATA	QA Contact:	Israel Pinto <ipinto>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.1.4	CC:	iheim, ipinto, istein, jentrena, lpeer, michal.skrivanek, mtessun, nbarcet, pdwyer, rbalakri, shavivi, sherold
Target Milestone:	ovirt-3.6.0-rc	Keywords:	FutureFeature, Improvement
Target Release:	3.6.0	Flags:	istein: needinfo+ sherold: Triaged+
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Release Note
Doc Text:	With this release, the downtime during a virtual machine migration is reported. This is the duration of the handover time needed to transfer the execution from the source host to the destination host (the last phase of migration). Note: as part of this enhancement a more strict clock synchronization is enforced between the Manager and hosts. Previously, there was an alert when the host was 5 minutes off the Manager time; now it is 100 ms. The reason is that for accurate downtime reporting the source and destination hosts must have the same clock time. This may cause a lot of new alerts in environments which are not configured properly. The configuration option (used in engine-config) has changed from 'HostTimeDriftInSec' to 'HostTimeDriftInMS'.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-03-09 20:31:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1063486, 1063724, 1138570, 1162588, 1208772, 1213434
Bug Blocks:

Comment 1 Michal Skrivanek 2013-06-12 05:23:19 UTC

libvirt provides "expected downtime" as part of job statistics, would that be enough? It's not the exact number though.

Comment 4 Julio Entrena Perez 2013-06-12 09:34:24 UTC

(In reply to Michal Skrivanek from comment #1)
> libvirt provides "expected downtime" as part of job statistics, would that
> be enough? It's not the exact number though.

No, this request is for webadmin portal to report in the Events section the incurred downtime by each live migration.

Comment 5 Michal Skrivanek 2013-07-03 10:40:40 UTC

(In reply to Julio Entrena Perez from comment #4)
> (In reply to Michal Skrivanek from comment #1)
> > libvirt provides "expected downtime" as part of job statistics, would that
> > be enough? It's not the exact number though.
> 
> No, this request is for webadmin portal to report in the Events section the
> incurred downtime by each live migration.
the need to see that in portal is understood.
Correlating timestamps from src and dst hosts would be difficult. We're polling for task status periodically so we can use the last one as a really close estimate. In most cases this should correspond to the real downtime

Other possibility is to report it afterwards.
If we do it in RHEV-M it still may be misleading if src and dst host time differs. IMHO libvirt/qemu should provide such value if it needs to be really exact

Comment 6 Julio Entrena Perez 2013-07-03 10:51:38 UTC

(In reply to Michal Skrivanek from comment #5)

> Other possibility is to report it afterwards.

That's indeed what the customer expects: downtime reported after live migration completion.

Currently RHEV-M webadmin portal reports the following in the Events section after a successful live migration:

Migration complete (VM: vm_name, Source Host: host_name)

They expect to see:

Migration complete (VM: vm_name, Source Host: host_name, Downtime xxx ms)

Comment 7 Shahar Havivi 2013-07-03 13:16:00 UTC

posted at: http://gerrit.ovirt.org/#/c/16399

Comment 8 Julio Entrena Perez 2013-07-12 12:36:58 UTC

(In reply to Shahar Havivi from comment #7)
> posted at: http://gerrit.ovirt.org/#/c/16399

Is this measuring the time elapsed between a VM is suspended in source host and the VM is resumed in destination host?

Proposed patch seems to be measuring the duration of the entire live migration.

This request is to report the *downtime* experienced by the VM during the live migration, that is the amount of time the VM is not running in any of the hosts, or in other words, the amount of time between the "Suspended" event in source host and the "Resumed" event in destination host.

Comment 9 Shahar Havivi 2013-07-14 07:35:17 UTC

(In reply to Julio Entrena Perez from comment #8)
You are right,
There will be different patch for this bug.

This patch may be posted because it give the user additional info for the time that the migration took time.

Comment 10 Julio Entrena Perez 2013-07-15 08:47:14 UTC

(In reply to Shahar Havivi from comment #9)
> (In reply to Julio Entrena Perez from comment #8)
> You are right,
> There will be different patch for this bug.
Thanks for clarifying this.
> 
> This patch may be posted because it give the user additional info for the
> time that the migration took time.
Thanks Shahar, customer would welcome RHEV-M reporting the duration of the entire live migration too in addition to the incurred downtime during it.

Comment 12 Arthur Berezin 2014-01-30 16:48:27 UTC

Scott, is this scoped for 3.5 ?

Comment 16 Michal Skrivanek 2015-03-04 11:03:48 UTC

one more thing - we should ensure hosts time are in sync. Currently we alert when the drift is 300s, that's too much, we need something like 100ms...

Comment 17 Michal Skrivanek 2015-03-05 14:59:54 UTC

setting Release note flag since we must mention the change of time drift tolerance from 5 mins to 100ms

Comment 18 Omer Frenkel 2015-03-08 11:58:19 UTC

(In reply to Michal Skrivanek from comment #17)
> setting Release note flag since we must mention the change of time drift
> tolerance from 5 mins to 100ms

maybe also worth noting that accordingly, the name of the configuration option changed from
HostTimeDriftInSec
to
HostTimeDriftInMS

(when using engine-config)

Comment 19 Michal Skrivanek 2015-04-21 07:21:22 UTC

see enhancement in libvirt reporting (bug 1213434), should provide more accurate numbers

Comment 21 Max Kovgan 2015-06-28 14:12:29 UTC

ovirt-3.6.0-3 release

Comment 22 Israel Pinto 2015-12-10 13:18:34 UTC

Verify with:
Setup:
RHEVM Version: 3.6.1.2-0.1.el6 
vdsm:vdsm-4.17.13-1.el7ev
libvirt:libvirt-1.2.17-13.el7_2.2

Test cases according to Polarion test case.

restuls: PASS

Comment 24 errata-xmlrpc 2016-03-09 20:31:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html