Bug 2026357 - Migration in sequence can be reported as failed even when it succeeded
Summary: Migration in sequence can be reported as failed even when it succeeded
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.9.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Antonio Cardace
QA Contact: vsibirsk
URL:
Whiteboard:
: 2052752 (view as bug list)
Depends On:
Blocks: 2077920
TreeView+ depends on / blocked
 
Reported: 2021-11-24 12:57 UTC by lpivarc
Modified: 2023-11-13 08:17 UTC (History)
8 users (show)

Fixed In Version: hco-bundle-registry-container-v4.11.0-254 virt-controller-container-v4.11.0-42
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2077920 (view as bug list)
Environment:
Last Closed: 2022-09-14 19:28:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-15042 0 None None None 2023-11-13 08:17:34 UTC

Description lpivarc 2021-11-24 12:57:48 UTC
Description of problem:
Performing migration multiple times can lead to reported failed migration even when all migrations succeed. 
Following is the reason:
    "reason": "FailedMigration",
            "message": "VMI's migration state was taken over by another migration job during active migration.",
            "source": {
                "component": "virtualmachine-controller"
            },


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 sgott 2021-11-24 13:10:40 UTC
Because the migration appears to succeed, our impression is that this is a state tracking bug as opposed to a scenario where an active migration was caused to fail. Targetting this for the next release.

Comment 3 sgott 2022-01-26 20:50:04 UTC
Deferring to the next release because it appears to only be a display-related issue.

Comment 4 Jed Lejosne 2022-02-10 14:35:14 UTC
Looks like a duplicate migration was created and took over.
This PR should prevent that from ever happening: https://github.com/kubevirt/kubevirt/pull/5242
Maybe there's still a race condition somewhere?
As far as user impact, I do believe it's only cosmetic, but only thanks to libvirt being resilient enough.

Comment 5 David Vossel 2022-02-10 16:15:18 UTC
> Looks like a duplicate migration was created and took over.

I'm not sure now. We saw this again recently here [1] and there wasn't a new migration object. It actually appeared like there was a sequence of events that could occur where we interpreted a previous migration object as taking over a more recent one. more details in my comment here [1].

1. https://bugzilla.redhat.com/show_bug.cgi?id=2021992#c15

Comment 6 lpivarc 2022-02-10 16:36:24 UTC
> This PR should prevent that from ever happening: https://github.com/kubevirt/kubevirt/pull/5242

Do we miss mutex there so potentially we can submit 2 migrations?

Comment 7 sgott 2022-03-24 14:22:06 UTC
Deferring because appears to be a status reporting issue (and capacity).

https://bugzilla.redhat.com/show_bug.cgi?id=2052752 was also deferred.

Comment 8 vsibirsk 2022-04-11 14:17:06 UTC
Was also reproduced several times on 4.10 cluster while running continuous migration test.
The test had 3 rhel8.5 vms (2 with ocs disk and 1 with container disk) and migrating them multiple times.
The issue occurred at around 100 migrations (300 at max).
Only VirtualMachineInstanceMigration object was in "Failed" state.
VMIs were showing new node, new virt-launcher pods running while old ones were in "Complete" state.
Virt-launcher logs also showed that migration was successful.

Comment 9 sgott 2022-04-13 12:13:12 UTC
Re-targetting back to 4.11. This issue is blocking scale testing being conducted by QE.

Comment 10 Antonio Cardace 2022-04-15 16:13:44 UTC
I was finally able to reproduce this, I've posted a fix https://github.com/kubevirt/kubevirt/pull/7582 which explains in detail the root cause.

Comment 11 Antonio Cardace 2022-04-21 11:16:22 UTC
*** Bug 2052752 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-09-14 19:28:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526


Note You need to log in before you can comment on or make changes to this bug.