Bug 1965099 - Live Migration double handoff to virt-handler causes connection failures
Summary: Live Migration double handoff to virt-handler causes connection failures
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.5.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.6.6
Assignee: David Vossel
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks: 1945532
TreeView+ depends on / blocked
 
Reported: 2021-05-26 20:08 UTC by David Vossel
Modified: 2023-09-15 01:07 UTC (History)
6 users (show)

Fixed In Version: virt-operator-container-v2.6.6-3 hco-bundle-registry-container-v2.6.6-24
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-10 17:33:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:3119 0 None None None 2021-08-10 17:34:27 UTC

Description David Vossel 2021-05-26 20:08:45 UTC
Description of problem:

When a migration is first being processed, there's a hand off that occurs between virt-controller to virt-handler on the target and source nodes. That hand off is coordinated through the vmi.Status.MigrationState field.

While investigating logs with the enhanced migration proxy debugging enabled, it has been discovered that a race condition leads to migration connections being setup, then torn down very quickly, then re-established. Those connections are coordinated by the vmi.Status.MigrationState field that is managed during the hand off.


This race condition appears more likely when the control plane components are under load. When the condition is hit, it can result in live migration failures.

Version-Release number of selected component (if applicable):
2.5.6

How reproducible:

very rare - We have logs supporting that this is indeed possible, but the conditions have not been reproduced in a lab setting.

Comment 1 David Vossel 2021-05-26 20:09:03 UTC
Backport PRs
CNV 2.5 - Kubevirt 0.34 - https://github.com/kubevirt/kubevirt/pull/5706
CNV 2.6 - KubeVirt 0.36 - https://github.com/kubevirt/kubevirt/pull/5705
CNV 4.8 - KubeVirt 0.41 - https://github.com/kubevirt/kubevirt/pull/5704

Comment 5 Kedar Bidarkar 2021-07-23 12:16:42 UTC
All the 3 worker nodes had 100% CPU Util and 125% CPU Saturation
Using stress-ng image.

1) Ran stress-ng pods on all the 3 Worker Nodes, to generate 100% CPU Load and 125% CPU Saturation
2) Moved to Fedora containerDisk Image from the Cirros containerDisk Image
2) Created a 2.5GB /home/fedora/disksump.img inside the Fedora VMI via the cloud-init "runcmd"
3) Created 100 VMI and 100 VMIM objects to Live Migrate each of them in a loop.

Ran almost 12 loops of creating 100 VMI and Live-Migrating them.

Summary: Was unable to observe this issue.

Comment 10 errata-xmlrpc 2021-08-10 17:33:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3119

Comment 11 Red Hat Bugzilla 2023-09-15 01:07:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.