Bug 2082164

Summary: Migration progress timeout expects absolute progress
Product: Container Native Virtualization (CNV) Reporter: Jed Lejosne <jlejosne>
Component: VirtualizationAssignee: Jed Lejosne <jlejosne>
Status: CLOSED ERRATA QA Contact: Denys Shchedrivyi <dshchedr>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8.4CC: acardace, cnv-qe-bugs, dshchedr
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hco-bundle-registry-container-v4.11.0-315 virt-launcher-container-v4.11.0-55 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-14 19:31:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jed Lejosne 2022-05-05 14:03:03 UTC
Description of problem:
The migration progress timeout is there to ensure that migration packets keep getting transferred from source to target.
If no activity happens for the defined amount of time (2.5 minutes by default), the migration is cancelled.

However, the current implementation expects the remaining data counter to make absolute progress within that time. By "absolute progress", I mean going down lower than ever before. If the remaining data goes up, which can happen for various reasons, then subsequent progress will not count as long as the value doesn't go back down below its lowest ever.

This is unreasonable in many scenarios, the worst case being a very active VM with lots of RAM and a slow network.

Instead, we should expect relative progress, resetting the timer every time the remaining data goes down from one poll to the next. That will effectively ensure data is flowing, without worrying about eventual convergence, which is ensured by other mechanisms.

Comment 1 Jed Lejosne 2022-05-05 14:06:04 UTC
Upstream PR linked.
As indicated by the (incomplete) PR title, it also fixes the error message when hitting the timeout, which used to report a nanoseconds value as seconds.

Comment 3 Denys Shchedrivyi 2022-05-25 13:22:14 UTC
Verified on CNV v4.11.0-334

Comment 5 errata-xmlrpc 2022-09-14 19:31:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526