2082164 – Migration progress timeout expects absolute progress

Bug 2082164 - Migration progress timeout expects absolute progress

Summary: Migration progress timeout expects absolute progress

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Jed Lejosne
QA Contact:	Denys Shchedrivyi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-05 14:03 UTC by Jed Lejosne
Modified:	2023-11-13 08:17 UTC (History)
CC List:	3 users (show)
Fixed In Version:	hco-bundle-registry-container-v4.11.0-315 virt-launcher-container-v4.11.0-55
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-14 19:31:44 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 7654	None	Merged	Show migration progress timeout in actual seconds	2022-05-05 14:16:07 UTC
Red Hat Issue Tracker	CNV-17996	None	None	None	2023-11-13 08:17:47 UTC
Red Hat Product Errata	RHSA-2022:6526	None	None	None	2022-09-14 19:32:00 UTC

Description Jed Lejosne 2022-05-05 14:03:03 UTC

Description of problem:
The migration progress timeout is there to ensure that migration packets keep getting transferred from source to target.
If no activity happens for the defined amount of time (2.5 minutes by default), the migration is cancelled.

However, the current implementation expects the remaining data counter to make absolute progress within that time. By "absolute progress", I mean going down lower than ever before. If the remaining data goes up, which can happen for various reasons, then subsequent progress will not count as long as the value doesn't go back down below its lowest ever.

This is unreasonable in many scenarios, the worst case being a very active VM with lots of RAM and a slow network.

Instead, we should expect relative progress, resetting the timer every time the remaining data goes down from one poll to the next. That will effectively ensure data is flowing, without worrying about eventual convergence, which is ensured by other mechanisms.

Comment 1 Jed Lejosne 2022-05-05 14:06:04 UTC

Upstream PR linked.
As indicated by the (incomplete) PR title, it also fixes the error message when hitting the timeout, which used to report a nanoseconds value as seconds.

Comment 3 Denys Shchedrivyi 2022-05-25 13:22:14 UTC

Verified on CNV v4.11.0-334

Comment 5 errata-xmlrpc 2022-09-14 19:31:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526

Note You need to log in before you can comment on or make changes to this bug.