Bug 2026357

Summary: Migration in sequence can be reported as failed even when it succeeded
Product: Container Native Virtualization (CNV) Reporter: lpivarc
Component: VirtualizationAssignee: Antonio Cardace <acardace>
Status: CLOSED ERRATA QA Contact: vsibirsk
Severity: high Docs Contact:
Priority: high    
Version: 4.9.1CC: cnv-qe-bugs, dbasunag, dshchedr, dvossel, jlejosne, kbidarka, sgott, vsibirsk
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hco-bundle-registry-container-v4.11.0-254 virt-controller-container-v4.11.0-42 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2077920 (view as bug list) Environment:
Last Closed: 2022-09-14 19:28:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2077920    

Description lpivarc 2021-11-24 12:57:48 UTC
Description of problem:
Performing migration multiple times can lead to reported failed migration even when all migrations succeed. 
Following is the reason:
    "reason": "FailedMigration",
            "message": "VMI's migration state was taken over by another migration job during active migration.",
            "source": {
                "component": "virtualmachine-controller"
            },


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 sgott 2021-11-24 13:10:40 UTC
Because the migration appears to succeed, our impression is that this is a state tracking bug as opposed to a scenario where an active migration was caused to fail. Targetting this for the next release.

Comment 3 sgott 2022-01-26 20:50:04 UTC
Deferring to the next release because it appears to only be a display-related issue.

Comment 4 Jed Lejosne 2022-02-10 14:35:14 UTC
Looks like a duplicate migration was created and took over.
This PR should prevent that from ever happening: https://github.com/kubevirt/kubevirt/pull/5242
Maybe there's still a race condition somewhere?
As far as user impact, I do believe it's only cosmetic, but only thanks to libvirt being resilient enough.

Comment 5 David Vossel 2022-02-10 16:15:18 UTC
> Looks like a duplicate migration was created and took over.

I'm not sure now. We saw this again recently here [1] and there wasn't a new migration object. It actually appeared like there was a sequence of events that could occur where we interpreted a previous migration object as taking over a more recent one. more details in my comment here [1].

1. https://bugzilla.redhat.com/show_bug.cgi?id=2021992#c15

Comment 6 lpivarc 2022-02-10 16:36:24 UTC
> This PR should prevent that from ever happening: https://github.com/kubevirt/kubevirt/pull/5242

Do we miss mutex there so potentially we can submit 2 migrations?

Comment 7 sgott 2022-03-24 14:22:06 UTC
Deferring because appears to be a status reporting issue (and capacity).

https://bugzilla.redhat.com/show_bug.cgi?id=2052752 was also deferred.

Comment 8 vsibirsk 2022-04-11 14:17:06 UTC
Was also reproduced several times on 4.10 cluster while running continuous migration test.
The test had 3 rhel8.5 vms (2 with ocs disk and 1 with container disk) and migrating them multiple times.
The issue occurred at around 100 migrations (300 at max).
Only VirtualMachineInstanceMigration object was in "Failed" state.
VMIs were showing new node, new virt-launcher pods running while old ones were in "Complete" state.
Virt-launcher logs also showed that migration was successful.

Comment 9 sgott 2022-04-13 12:13:12 UTC
Re-targetting back to 4.11. This issue is blocking scale testing being conducted by QE.

Comment 10 Antonio Cardace 2022-04-15 16:13:44 UTC
I was finally able to reproduce this, I've posted a fix https://github.com/kubevirt/kubevirt/pull/7582 which explains in detail the root cause.

Comment 11 Antonio Cardace 2022-04-21 11:16:22 UTC
*** Bug 2052752 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-09-14 19:28:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526