1965099 – Live Migration double handoff to virt-handler causes connection failures

Bug 1965099 - Live Migration double handoff to virt-handler causes connection failures

Summary: Live Migration double handoff to virt-handler causes connection failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.5.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.6.6
Assignee:	David Vossel
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1945532
TreeView+	depends on / blocked

Reported:	2021-05-26 20:08 UTC by David Vossel
Modified:	2023-09-15 01:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:	virt-operator-container-v2.6.6-3 hco-bundle-registry-container-v2.6.6-24
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-10 17:33:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:3119	0	None	None	None	2021-08-10 17:34:27 UTC

Description David Vossel 2021-05-26 20:08:45 UTC

Description of problem:

When a migration is first being processed, there's a hand off that occurs between virt-controller to virt-handler on the target and source nodes. That hand off is coordinated through the vmi.Status.MigrationState field.

While investigating logs with the enhanced migration proxy debugging enabled, it has been discovered that a race condition leads to migration connections being setup, then torn down very quickly, then re-established. Those connections are coordinated by the vmi.Status.MigrationState field that is managed during the hand off.


This race condition appears more likely when the control plane components are under load. When the condition is hit, it can result in live migration failures.

Version-Release number of selected component (if applicable):
2.5.6

How reproducible:

very rare - We have logs supporting that this is indeed possible, but the conditions have not been reproduced in a lab setting.

Comment 1 David Vossel 2021-05-26 20:09:03 UTC

Backport PRs
CNV 2.5 - Kubevirt 0.34 - https://github.com/kubevirt/kubevirt/pull/5706
CNV 2.6 - KubeVirt 0.36 - https://github.com/kubevirt/kubevirt/pull/5705
CNV 4.8 - KubeVirt 0.41 - https://github.com/kubevirt/kubevirt/pull/5704

Comment 5 Kedar Bidarkar 2021-07-23 12:16:42 UTC

All the 3 worker nodes had 100% CPU Util and 125% CPU Saturation
Using stress-ng image.

1) Ran stress-ng pods on all the 3 Worker Nodes, to generate 100% CPU Load and 125% CPU Saturation
2) Moved to Fedora containerDisk Image from the Cirros containerDisk Image
2) Created a 2.5GB /home/fedora/disksump.img inside the Fedora VMI via the cloud-init "runcmd"
3) Created 100 VMI and 100 VMIM objects to Live Migrate each of them in a loop.

Ran almost 12 loops of creating 100 VMI and Live-Migrating them.

Summary: Was unable to observe this issue.

Comment 10 errata-xmlrpc 2021-08-10 17:33:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3119

Comment 11 Red Hat Bugzilla 2023-09-15 01:07:16 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.