Description of problem: The migration progress timeout is there to ensure that migration packets keep getting transferred from source to target. If no activity happens for the defined amount of time (2.5 minutes by default), the migration is cancelled. However, the current implementation expects the remaining data counter to make absolute progress within that time. By "absolute progress", I mean going down lower than ever before. If the remaining data goes up, which can happen for various reasons, then subsequent progress will not count as long as the value doesn't go back down below its lowest ever. This is unreasonable in many scenarios, the worst case being a very active VM with lots of RAM and a slow network. Instead, we should expect relative progress, resetting the timer every time the remaining data goes down from one poll to the next. That will effectively ensure data is flowing, without worrying about eventual convergence, which is ensured by other mechanisms.
Upstream PR linked. As indicated by the (incomplete) PR title, it also fixes the error message when hitting the timeout, which used to report a nanoseconds value as seconds.
Verified on CNV v4.11.0-334
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6526