2124528 – On upgrade, when live-migration is failed due to an infra issue, virt-handler continuously and endlessly tries to migrate it

Bug 2124528 - On upgrade, when live-migration is failed due to an infra issue, virt-handler continuously and endlessly tries to migrate it

Summary: On upgrade, when live-migration is failed due to an infra issue, virt-handler...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Antonio Cardace
QA Contact:	Denys Shchedrivyi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2149631
TreeView+	depends on / blocked

Reported:	2022-09-06 12:33 UTC by Oren Cohen
Modified:	2023-01-24 13:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:	virt-operator-container-v4.12.0-249 hco-bundle-registry-container-v4.12.0-755
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2149631 (view as bug list)
Environment:
Last Closed:	2023-01-24 13:40:29 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
virt-handler.log (1.88 MB, text/plain) 2022-09-06 12:33 UTC, Oren Cohen	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 8530	None	open	Use exponential backoff for failing migrations	2022-10-27 09:36:28 UTC
Github	kubevirt kubevirt pull 8784	None	Merged	[release-0.58] Use exponential backoff for failing migrations	2022-11-30 13:04:26 UTC
Red Hat Issue Tracker	CNV-21049	None	None	None	2022-11-15 05:01:49 UTC
Red Hat Product Errata	RHSA-2023:0408	None	None	None	2023-01-24 13:40:41 UTC

Description Oren Cohen 2022-09-06 12:33:26 UTC

Created attachment 1909776 [details]
virt-handler.log

Description of problem:
When upgrading CNV, all of the VirtualMachines in the cluster are being live-migrated in order to update their virt-launchers.
If there is an issue in a node that hosting VMs that prevents migration from it to another node, due to the fact that migration-proxy can't establish a connection between the target and source node, the target virt-launcher pod is exited in an Error state.
In that case, virt-handler trying to migrate it again, failing to do so for the same reason.
The default value for "parallelOutboundMigrationsPerNode" is 5, meaning that the failed virt-launcher pods are accumulating on the cluster in a rate of 5 per every few minutes.
If the root cause is not resolved, the number of pods in Error state can reach few thousands in several hours, which might bring the cluster down due to enormous number of etcd objects.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Have a running VMI on a node with networking issues.
2. Complete an upgrade of CNV
3. 

Actual results:
Note the high amount of Errored virt-launcher pods that are keeping to accumulate endlessly.


Expected results:
Kubevirt should monitor that issue, stop the migrations from the node in question and raise a proper high-severity alert.

Additional info:
virt-handler pod logs from the node in question when the issue is occurring is attached.

Comment 1 sgott 2022-09-07 12:14:38 UTC

The looping endlessly is an intentional design pattern of Kubernetes, because it's a declarative system. If we were to halt the migration process, upgrade couldn't finish. That's an even worse outcome.

What's a greater concern to me is that garbage collection does not appear to be occuring. There should not be thousands of defunct pods as a result of this.

Comment 2 sgott 2022-09-07 12:20:17 UTC

Prioritizing this as urgent because it is unpleasant during an upgrade process, and if this situation occurs, there's no immediate/easy way to remediate the issue.

Comment 3 Antonio Cardace 2022-09-15 10:09:12 UTC

@sgott We do garbage collection only on the migration objects, not on the target pods.

We can use the same garbage collection mechanism for pods, even if it can be debatable as those pods might contain useful info that a cluster admin might want to inspect to understand what's wrong with the cluster.

Comment 4 Antonio Cardace 2022-11-18 16:11:33 UTC

Deferring to 4.12.1 because we're past blockers-only and this is not considered a blocker for 4.12.

Comment 7 sgott 2022-12-12 15:24:50 UTC

To verify: repeat reproduction steps in description.

Comment 9 Denys Shchedrivyi 2022-12-20 21:12:18 UTC

Verified on v4.12.0-760 - looks good.

When migration failed and quickly re-created - it goes into Pending state for some time. There is a message in VMI:

  Warning  MigrationBackoff  44s (x14 over 41m)    virtualmachine-controller    backoff migrating vmi default/vm-fedora-bad-1

Comment 12 errata-xmlrpc 2023-01-24 13:40:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408

Note You need to log in before you can comment on or make changes to this bug.