Bug 2082164 - Migration progress timeout expects absolute progress
Summary: Migration progress timeout expects absolute progress
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.8.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Jed Lejosne
QA Contact: Denys Shchedrivyi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-05 14:03 UTC by Jed Lejosne
Modified: 2023-11-13 08:17 UTC (History)
3 users (show)

Fixed In Version: hco-bundle-registry-container-v4.11.0-315 virt-launcher-container-v4.11.0-55
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-14 19:31:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 7654 0 None Merged Show migration progress timeout in actual seconds 2022-05-05 14:16:07 UTC
Red Hat Issue Tracker CNV-17996 0 None None None 2023-11-13 08:17:47 UTC
Red Hat Product Errata RHSA-2022:6526 0 None None None 2022-09-14 19:32:00 UTC

Description Jed Lejosne 2022-05-05 14:03:03 UTC
Description of problem:
The migration progress timeout is there to ensure that migration packets keep getting transferred from source to target.
If no activity happens for the defined amount of time (2.5 minutes by default), the migration is cancelled.

However, the current implementation expects the remaining data counter to make absolute progress within that time. By "absolute progress", I mean going down lower than ever before. If the remaining data goes up, which can happen for various reasons, then subsequent progress will not count as long as the value doesn't go back down below its lowest ever.

This is unreasonable in many scenarios, the worst case being a very active VM with lots of RAM and a slow network.

Instead, we should expect relative progress, resetting the timer every time the remaining data goes down from one poll to the next. That will effectively ensure data is flowing, without worrying about eventual convergence, which is ensured by other mechanisms.

Comment 1 Jed Lejosne 2022-05-05 14:06:04 UTC
Upstream PR linked.
As indicated by the (incomplete) PR title, it also fixes the error message when hitting the timeout, which used to report a nanoseconds value as seconds.

Comment 3 Denys Shchedrivyi 2022-05-25 13:22:14 UTC
Verified on CNV v4.11.0-334

Comment 5 errata-xmlrpc 2022-09-14 19:31:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526


Note You need to log in before you can comment on or make changes to this bug.