Bug 2021992

Summary:	[cnv-4.10.0] After upgrade, live migration is Pending
Product:	Container Native Virtualization (CNV)	Reporter:	Ruth Netser <rnetser>
Component:	Virtualization	Assignee:	David Vossel <dvossel>
Status:	CLOSED ERRATA	QA Contact:	Kedar Bidarkar <kbidarka>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.9.0	CC:	cnv-qe-bugs, danken, dbasunag, dshchedr, dvossel, fdeutsch, giridhar.ramaraju, ibesso, kbidarka, lpivarc, oramraz, pelauter, sgott, stirabos, ycui
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	hco-bundle-registry-container-v4.10.0-648 virt-operator-container-v4.10.0-207	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2017394	Environment:
Last Closed:	2022-03-16 15:56:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2017394, 2017573
Bug Blocks:

Comment 1 Kedar Bidarkar 2021-11-10 14:38:02 UTC

I believe this bug is to track to have the following:

- Add logic to cancel pending migrations after a timeout ( example: 10m) (which will cause a new randomly selected VMI to migrate) 
  ( David Vossel's PR to Add Logic to cancel pending migrations:  https://github.com/kubevirt/kubevirt/pull/6737 )
- Add pretty migration cli output, better migration events, and better logging to indicate when migrations are blocked due to global limits 
- Enable automatic workload updates now that pending migrations can timeout (https://issues.redhat.com/browse/CNV-8762), allowing migrations to be unblocked.

Comment 2 David Vossel 2021-11-10 21:09:55 UTC

(In reply to Kedar Bidarkar from comment #1)
> I believe this bug is to track to have the following:
> 
> - Add logic to cancel pending migrations after a timeout ( example: 10m)
> (which will cause a new randomly selected VMI to migrate) 
>   ( David Vossel's PR to Add Logic to cancel pending migrations: 
> https://github.com/kubevirt/kubevirt/pull/6737 )
> - Add pretty migration cli output, better migration events, and better
> logging to indicate when migrations are blocked due to global limits 
> - Enable automatic workload updates now that pending migrations can timeout
> (https://issues.redhat.com/browse/CNV-8762), allowing migrations to be
> unblocked.


For PR 6737 I target a very specific scenario where Pending pods which are Unschedulable will time out after 5 minutes.

After giving this issue more thought, I think we need a followup PR which will time out a migration when the target pod doesn't reach a running phase after a reasonable period of time.

With the Unschedulable scenario that PR 6737 addresses, I felt comfortable with 5 minute timeout period because I have a good understanding of exactly why something like this would occur. A "catch all" timeout would need to be longer for me to feel comfortable it wouldn't be triggered inadvertently (for example by a slow container image pull). I think for this catch all, we'll likely need to use a more relaxed timeout in the 15-30 minute range.

Comment 3 David Vossel 2021-12-14 22:50:16 UTC

Here's an update on the progress

=======
PR 6737 https://github.com/kubevirt/kubevirt/pull/6737
=======
This PR introduces timeout logic for pending migrations. I enhanced this PR from my previous comment to include a catch all timeout for migrations stuck in pending for any reason (regardless if the target pod is unschedulable). This would account for issues such as inability to pull container images, or perhaps some future unknown reason for being stuck in "pending"

The logic for PR 6537 is as follows.
- 5m timeout will cancel any target pod stuck specifically in an "unschedulable" pending pods for > 5m
- 15m timeout will cancel any target pod stuck in pending for any reason > 15m

=======
PR 6836 https://github.com/kubevirt/kubevirt/pull/6836
=======

This PR adds pretty cli output that allows us to easily map migration objects to the VMI they target. Before this, it was exceedingly difficult to determine what migration object was involved with what VMI.

=======
PR 6950 https://github.com/kubevirt/kubevirt/pull/6950
=======

This PR adds log messages to the migration controller that indicate when a global parallel migration limit has been hit. Before this, it was difficult to debug why migrations appeared to be "stuck" because we had no indication that a migration limit was hit. These log messages bring visibility there.


Status:
PR 6737 and 6836 are already merged upstream. We're waiting on the log message PR 6950 to be merged still though.

Comment 4 Debarati Basu-Nag 2022-01-26 17:04:09 UTC

Verified against 4.9.2->4.10.0.  Marking as failed qe for garbage collection of old failed vmim objects associated with vms that can't be migrated as appropriate target node does not exist. 
======================
NAMESPACE                                        NAME                             PHASE        VMI
kmp-enabled-for-upgrade                          kubevirt-workload-update-6xvbj   Failed       vm-upgrade-a-1643206358-0244083
kmp-enabled-for-upgrade                          kubevirt-workload-update-7dggt   Failed       vm-upgrade-b-1643206398-419199
kmp-enabled-for-upgrade                          kubevirt-workload-update-7qj4g   Failed       vm-upgrade-a-1643206358-0244083
kmp-enabled-for-upgrade                          kubevirt-workload-update-bpkb5   Failed       vm-upgrade-b-1643206398-419199
kmp-enabled-for-upgrade                          kubevirt-workload-update-cxg5d   Failed       vm-upgrade-a-1643206358-0244083
kmp-enabled-for-upgrade                          kubevirt-workload-update-ff86h   Failed       vm-upgrade-b-1643206398-419199
kmp-enabled-for-upgrade                          kubevirt-workload-update-g4jz9   Failed       vm-upgrade-b-1643206398-419199
kmp-enabled-for-upgrade                          kubevirt-workload-update-hb6g4   Failed       vm-upgrade-a-1643206358-0244083
kmp-enabled-for-upgrade                          kubevirt-workload-update-k2n5w   Failed       vm-upgrade-a-1643206358-0244083
kmp-enabled-for-upgrade                          kubevirt-workload-update-mtpsl   Failed       vm-upgrade-b-1643206398-419199
kmp-enabled-for-upgrade                          kubevirt-workload-update-n6wnl   Failed       vm-upgrade-b-1643206398-419199
kmp-enabled-for-upgrade                          kubevirt-workload-update-pzcmm   Pending      vm-upgrade-b-1643206398-419199
kmp-enabled-for-upgrade                          kubevirt-workload-update-tzv2j   Failed       vm-upgrade-b-1643206398-419199
kmp-enabled-for-upgrade                          kubevirt-workload-update-v2jrr   Failed       vm-upgrade-a-1643206358-0244083
kmp-enabled-for-upgrade                          kubevirt-workload-update-w842s   Pending      vm-upgrade-a-1643206358-0244083
kmp-enabled-for-upgrade                          kubevirt-workload-update-xzchr   Failed       vm-upgrade-a-1643206358-0244083
=====================

Comment 5 sgott 2022-01-26 21:52:46 UTC

Proposing this as a release blocker. Preventing VMs from being able to migrate can affect ability to update.

Comment 6 David Vossel 2022-01-26 21:56:17 UTC

PR posted which handles the migration garbage collection. https://github.com/kubevirt/kubevirt/pull/7144

With this PR, only the most recent 5 finalized migration objects for a VMI will remain in the system. This prevents an indefinitely long list of migration objects from building up for a single vmi.

Comment 7 Fabian Deutsch 2022-01-28 10:09:46 UTC

Please clarify:
1. For the failed migrations: Were they expected to fail?
2. Please clarify if this bug is about a) a failing migration or b) a display bug around the list of failed migrations

Comment 8 David Vossel 2022-01-28 13:21:55 UTC

(In reply to Fabian Deutsch from comment #7)
> Please clarify:
> 1. For the failed migrations: Were they expected to fail?

yes, these where the VMIs in the test that were supposed to fail live migration due to only having correct resources to run on a single node... So they'd get stuck in pending forever. The test is supposed to verify that those pending migrations eventually time out, which they do now.

The automated workload updates are eventually consistent, so those VMIs will continually get live migration re-tried over time. Since we don't garbage collect old migration objects, this leads to an ever increasingly large list of migratin objects associated with a VMI.

The (hopefully) final change here is that live migrations need to be garbage collected, which is what my pr https://github.com/kubevirt/kubevirt/pull/7144 addresses.

> 2. Please clarify if this bug is about a) a failing migration or b) a
> display bug around the list of failed migrations

neither, but closer to (b). The bug was kicked back from ON_QA due to the fact that finalized live migrations never garbage collect, which can lead to an indefinitely long (and growing) list of migration objects per a vmi.

I had concerns about the number of migration objects we could potentially be storing in etcd under load, which is why I wanted this issue of garbage collecting migrations addressed here.

Comment 14 Debarati Basu-Nag 2022-02-09 20:56:37 UTC

@dvossel I checked the associated vmi, it looks like the image was up-to-date. It was pointing to:
launcherContainerImageVersion: registry.redhat.io/container-native-virtualization/virt-launcher@sha256:7e52dd9a08df07e909a729a941e6360250ab95fc3ee06c04c3d74537aed9f513
I cross referenced it with csv.spec.relatedImages and other migrated vms.
[cnv-qe-jenkins@c01-cnv410-upg-2dx9x-executor ~]$ kubectl get csv kubevirt-hyperconverged-operator.v4.10.0 -n openshift-cnv -o json | jq ".spec.relatedImages" | grep launcher
    "image": "registry.redhat.io/container-native-virtualization/virt-launcher@sha256:7e52dd9a08df07e909a729a941e6360250ab95fc3ee06c04c3d74537aed9f513",
    "name": "registry.redhat.io/container-native-virtualization/virt-launcher:v4.10.0-210"
[cnv-qe-jenkins@c01-cnv410-upg-2dx9x-executor ~]$ 

 So in this case it looks like the migration actually completed as noted in virt-controller log, the associated vmim continued to report status as Failed and no other vmim was created. Please advice if a separate bz should be created for this.

Comment 15 David Vossel 2022-02-09 21:29:59 UTC

> So in this case it looks like the migration actually completed as noted in virt-controller log, the associated vmim continued to report status as Failed and no other vmim was created. Please advice if a separate bz should be created for this.

Here's my thought.

The original issue we needed to solve involved the entire cluster blocking current and future migrations indefinitely due to migration objects being stuck in pending. I believe we've proven that issue is resolved. Pending migrations now timeout and new migrations can continue to be processed in time.

I believe the new issue you've identified is a real problem that needs to be investigated further, however based on the changes involved with this BZ I think it is unlikely that any of the code changes created for this PR are contributing to the new issue (anything is possible though).

For the new issue, what has most likely occurred is that the migration controller's timing has slightly altered in this new CNV release in a way that causes the migration reconcile loop to pop slightly quicker during the VMI target pod handoff to virt-handler. Looking at the code, I can see how the migration controller might interpret that a vmi has been taken over by a new migration when in fact the controller is looking at a slightly outdated VMI that has data associated with a previous migration.

The flow would look like this.

1. VMI is migrated successfully previously (and has vmi.Status.MigrationState associated with it from previous migration)
2. New Migration is created and vmi.Status.MigrationState from previous migration is wiped clean
3. Reconcile loop pops for new migration before the VMI informer catches up to the new MigrationState (meaning that the old migration state is still present)
4. Migration controller interprets old migration state on the VMI as being a "newer" migration taking over the VMI.
5. Migration transitions to failed... however since the handoff to virt-handler actually did actually occur, the migration continues and succeeds.

So, all that said. My recommendation is we make a new BZ to track new newly identified issue and treat it independently.

Comment 16 Debarati Basu-Nag 2022-02-09 22:49:41 UTC

Thanks David. I have logged https://bugzilla.redhat.com/show_bug.cgi?id=2052752, to track the newly identified issue. Marking this BZ as verified, as garbage collection on failed vmims are working as expected and cluster wide current/future migration is no longer seemed to be blocked.

Both verified against 4.10.0-662

Comment 21 errata-xmlrpc 2022-03-16 15:56:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947