Description of problem: When a migration is first being processed, there's a hand off that occurs between virt-controller to virt-handler on the target and source nodes. That hand off is coordinated through the vmi.Status.MigrationState field. While investigating logs with the enhanced migration proxy debugging enabled, it has been discovered that a race condition leads to migration connections being setup, then torn down very quickly, then re-established. Those connections are coordinated by the vmi.Status.MigrationState field that is managed during the hand off. This race condition appears more likely when the control plane components are under load. When the condition is hit, it can result in live migration failures. Version-Release number of selected component (if applicable): 2.5.6 How reproducible: very rare - We have logs supporting that this is indeed possible, but the conditions have not been reproduced in a lab setting.
Backport PRs CNV 2.5 - Kubevirt 0.34 - https://github.com/kubevirt/kubevirt/pull/5706 CNV 2.6 - KubeVirt 0.36 - https://github.com/kubevirt/kubevirt/pull/5705 CNV 4.8 - KubeVirt 0.41 - https://github.com/kubevirt/kubevirt/pull/5704
All the 3 worker nodes had 100% CPU Util and 125% CPU Saturation Using stress-ng image. 1) Ran stress-ng pods on all the 3 Worker Nodes, to generate 100% CPU Load and 125% CPU Saturation 2) Moved to Fedora containerDisk Image from the Cirros containerDisk Image 2) Created a 2.5GB /home/fedora/disksump.img inside the Fedora VMI via the cloud-init "runcmd" 3) Created 100 VMI and 100 VMIM objects to Live Migrate each of them in a loop. Ran almost 12 loops of creating 100 VMI and Live-Migrating them. Summary: Was unable to observe this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3119
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days