Bug 2021992
Summary: | [cnv-4.10.0] After upgrade, live migration is Pending | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Ruth Netser <rnetser> |
Component: | Virtualization | Assignee: | David Vossel <dvossel> |
Status: | CLOSED ERRATA | QA Contact: | Kedar Bidarkar <kbidarka> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.9.0 | CC: | cnv-qe-bugs, danken, dbasunag, dshchedr, dvossel, fdeutsch, giridhar.ramaraju, ibesso, kbidarka, lpivarc, oramraz, pelauter, sgott, stirabos, ycui |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | hco-bundle-registry-container-v4.10.0-648 virt-operator-container-v4.10.0-207 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 2017394 | Environment: | |
Last Closed: | 2022-03-16 15:56:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2017394, 2017573 | ||
Bug Blocks: |
Comment 1
Kedar Bidarkar
2021-11-10 14:38:02 UTC
(In reply to Kedar Bidarkar from comment #1) > I believe this bug is to track to have the following: > > - Add logic to cancel pending migrations after a timeout ( example: 10m) > (which will cause a new randomly selected VMI to migrate) > ( David Vossel's PR to Add Logic to cancel pending migrations: > https://github.com/kubevirt/kubevirt/pull/6737 ) > - Add pretty migration cli output, better migration events, and better > logging to indicate when migrations are blocked due to global limits > - Enable automatic workload updates now that pending migrations can timeout > (https://issues.redhat.com/browse/CNV-8762), allowing migrations to be > unblocked. For PR 6737 I target a very specific scenario where Pending pods which are Unschedulable will time out after 5 minutes. After giving this issue more thought, I think we need a followup PR which will time out a migration when the target pod doesn't reach a running phase after a reasonable period of time. With the Unschedulable scenario that PR 6737 addresses, I felt comfortable with 5 minute timeout period because I have a good understanding of exactly why something like this would occur. A "catch all" timeout would need to be longer for me to feel comfortable it wouldn't be triggered inadvertently (for example by a slow container image pull). I think for this catch all, we'll likely need to use a more relaxed timeout in the 15-30 minute range. Here's an update on the progress ======= PR 6737 https://github.com/kubevirt/kubevirt/pull/6737 ======= This PR introduces timeout logic for pending migrations. I enhanced this PR from my previous comment to include a catch all timeout for migrations stuck in pending for any reason (regardless if the target pod is unschedulable). This would account for issues such as inability to pull container images, or perhaps some future unknown reason for being stuck in "pending" The logic for PR 6537 is as follows. - 5m timeout will cancel any target pod stuck specifically in an "unschedulable" pending pods for > 5m - 15m timeout will cancel any target pod stuck in pending for any reason > 15m ======= PR 6836 https://github.com/kubevirt/kubevirt/pull/6836 ======= This PR adds pretty cli output that allows us to easily map migration objects to the VMI they target. Before this, it was exceedingly difficult to determine what migration object was involved with what VMI. ======= PR 6950 https://github.com/kubevirt/kubevirt/pull/6950 ======= This PR adds log messages to the migration controller that indicate when a global parallel migration limit has been hit. Before this, it was difficult to debug why migrations appeared to be "stuck" because we had no indication that a migration limit was hit. These log messages bring visibility there. Status: PR 6737 and 6836 are already merged upstream. We're waiting on the log message PR 6950 to be merged still though. Verified against 4.9.2->4.10.0. Marking as failed qe for garbage collection of old failed vmim objects associated with vms that can't be migrated as appropriate target node does not exist. ====================== NAMESPACE NAME PHASE VMI kmp-enabled-for-upgrade kubevirt-workload-update-6xvbj Failed vm-upgrade-a-1643206358-0244083 kmp-enabled-for-upgrade kubevirt-workload-update-7dggt Failed vm-upgrade-b-1643206398-419199 kmp-enabled-for-upgrade kubevirt-workload-update-7qj4g Failed vm-upgrade-a-1643206358-0244083 kmp-enabled-for-upgrade kubevirt-workload-update-bpkb5 Failed vm-upgrade-b-1643206398-419199 kmp-enabled-for-upgrade kubevirt-workload-update-cxg5d Failed vm-upgrade-a-1643206358-0244083 kmp-enabled-for-upgrade kubevirt-workload-update-ff86h Failed vm-upgrade-b-1643206398-419199 kmp-enabled-for-upgrade kubevirt-workload-update-g4jz9 Failed vm-upgrade-b-1643206398-419199 kmp-enabled-for-upgrade kubevirt-workload-update-hb6g4 Failed vm-upgrade-a-1643206358-0244083 kmp-enabled-for-upgrade kubevirt-workload-update-k2n5w Failed vm-upgrade-a-1643206358-0244083 kmp-enabled-for-upgrade kubevirt-workload-update-mtpsl Failed vm-upgrade-b-1643206398-419199 kmp-enabled-for-upgrade kubevirt-workload-update-n6wnl Failed vm-upgrade-b-1643206398-419199 kmp-enabled-for-upgrade kubevirt-workload-update-pzcmm Pending vm-upgrade-b-1643206398-419199 kmp-enabled-for-upgrade kubevirt-workload-update-tzv2j Failed vm-upgrade-b-1643206398-419199 kmp-enabled-for-upgrade kubevirt-workload-update-v2jrr Failed vm-upgrade-a-1643206358-0244083 kmp-enabled-for-upgrade kubevirt-workload-update-w842s Pending vm-upgrade-a-1643206358-0244083 kmp-enabled-for-upgrade kubevirt-workload-update-xzchr Failed vm-upgrade-a-1643206358-0244083 ===================== Proposing this as a release blocker. Preventing VMs from being able to migrate can affect ability to update. PR posted which handles the migration garbage collection. https://github.com/kubevirt/kubevirt/pull/7144 With this PR, only the most recent 5 finalized migration objects for a VMI will remain in the system. This prevents an indefinitely long list of migration objects from building up for a single vmi. Please clarify: 1. For the failed migrations: Were they expected to fail? 2. Please clarify if this bug is about a) a failing migration or b) a display bug around the list of failed migrations (In reply to Fabian Deutsch from comment #7) > Please clarify: > 1. For the failed migrations: Were they expected to fail? yes, these where the VMIs in the test that were supposed to fail live migration due to only having correct resources to run on a single node... So they'd get stuck in pending forever. The test is supposed to verify that those pending migrations eventually time out, which they do now. The automated workload updates are eventually consistent, so those VMIs will continually get live migration re-tried over time. Since we don't garbage collect old migration objects, this leads to an ever increasingly large list of migratin objects associated with a VMI. The (hopefully) final change here is that live migrations need to be garbage collected, which is what my pr https://github.com/kubevirt/kubevirt/pull/7144 addresses. > 2. Please clarify if this bug is about a) a failing migration or b) a > display bug around the list of failed migrations neither, but closer to (b). The bug was kicked back from ON_QA due to the fact that finalized live migrations never garbage collect, which can lead to an indefinitely long (and growing) list of migration objects per a vmi. I had concerns about the number of migration objects we could potentially be storing in etcd under load, which is why I wanted this issue of garbage collecting migrations addressed here. @dvossel I checked the associated vmi, it looks like the image was up-to-date. It was pointing to: launcherContainerImageVersion: registry.redhat.io/container-native-virtualization/virt-launcher@sha256:7e52dd9a08df07e909a729a941e6360250ab95fc3ee06c04c3d74537aed9f513 I cross referenced it with csv.spec.relatedImages and other migrated vms. [cnv-qe-jenkins@c01-cnv410-upg-2dx9x-executor ~]$ kubectl get csv kubevirt-hyperconverged-operator.v4.10.0 -n openshift-cnv -o json | jq ".spec.relatedImages" | grep launcher "image": "registry.redhat.io/container-native-virtualization/virt-launcher@sha256:7e52dd9a08df07e909a729a941e6360250ab95fc3ee06c04c3d74537aed9f513", "name": "registry.redhat.io/container-native-virtualization/virt-launcher:v4.10.0-210" [cnv-qe-jenkins@c01-cnv410-upg-2dx9x-executor ~]$ So in this case it looks like the migration actually completed as noted in virt-controller log, the associated vmim continued to report status as Failed and no other vmim was created. Please advice if a separate bz should be created for this. > So in this case it looks like the migration actually completed as noted in virt-controller log, the associated vmim continued to report status as Failed and no other vmim was created. Please advice if a separate bz should be created for this.
Here's my thought.
The original issue we needed to solve involved the entire cluster blocking current and future migrations indefinitely due to migration objects being stuck in pending. I believe we've proven that issue is resolved. Pending migrations now timeout and new migrations can continue to be processed in time.
I believe the new issue you've identified is a real problem that needs to be investigated further, however based on the changes involved with this BZ I think it is unlikely that any of the code changes created for this PR are contributing to the new issue (anything is possible though).
For the new issue, what has most likely occurred is that the migration controller's timing has slightly altered in this new CNV release in a way that causes the migration reconcile loop to pop slightly quicker during the VMI target pod handoff to virt-handler. Looking at the code, I can see how the migration controller might interpret that a vmi has been taken over by a new migration when in fact the controller is looking at a slightly outdated VMI that has data associated with a previous migration.
The flow would look like this.
1. VMI is migrated successfully previously (and has vmi.Status.MigrationState associated with it from previous migration)
2. New Migration is created and vmi.Status.MigrationState from previous migration is wiped clean
3. Reconcile loop pops for new migration before the VMI informer catches up to the new MigrationState (meaning that the old migration state is still present)
4. Migration controller interprets old migration state on the VMI as being a "newer" migration taking over the VMI.
5. Migration transitions to failed... however since the handoff to virt-handler actually did actually occur, the migration continues and succeeds.
So, all that said. My recommendation is we make a new BZ to track new newly identified issue and treat it independently.
Thanks David. I have logged https://bugzilla.redhat.com/show_bug.cgi?id=2052752, to track the newly identified issue. Marking this BZ as verified, as garbage collection on failed vmims are working as expected and cluster wide current/future migration is no longer seemed to be blocked. Both verified against 4.10.0-662 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947 |