Description of problem: Performing migration multiple times can lead to reported failed migration even when all migrations succeed. Following is the reason: "reason": "FailedMigration", "message": "VMI's migration state was taken over by another migration job during active migration.", "source": { "component": "virtualmachine-controller" }, Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Because the migration appears to succeed, our impression is that this is a state tracking bug as opposed to a scenario where an active migration was caused to fail. Targetting this for the next release.
Deferring to the next release because it appears to only be a display-related issue.
Looks like a duplicate migration was created and took over. This PR should prevent that from ever happening: https://github.com/kubevirt/kubevirt/pull/5242 Maybe there's still a race condition somewhere? As far as user impact, I do believe it's only cosmetic, but only thanks to libvirt being resilient enough.
> Looks like a duplicate migration was created and took over. I'm not sure now. We saw this again recently here [1] and there wasn't a new migration object. It actually appeared like there was a sequence of events that could occur where we interpreted a previous migration object as taking over a more recent one. more details in my comment here [1]. 1. https://bugzilla.redhat.com/show_bug.cgi?id=2021992#c15
> This PR should prevent that from ever happening: https://github.com/kubevirt/kubevirt/pull/5242 Do we miss mutex there so potentially we can submit 2 migrations?
Deferring because appears to be a status reporting issue (and capacity). https://bugzilla.redhat.com/show_bug.cgi?id=2052752 was also deferred.
Was also reproduced several times on 4.10 cluster while running continuous migration test. The test had 3 rhel8.5 vms (2 with ocs disk and 1 with container disk) and migrating them multiple times. The issue occurred at around 100 migrations (300 at max). Only VirtualMachineInstanceMigration object was in "Failed" state. VMIs were showing new node, new virt-launcher pods running while old ones were in "Complete" state. Virt-launcher logs also showed that migration was successful.
Re-targetting back to 4.11. This issue is blocking scale testing being conducted by QE.
I was finally able to reproduce this, I've posted a fix https://github.com/kubevirt/kubevirt/pull/7582 which explains in detail the root cause.
*** Bug 2052752 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6526