Bug 2017394
Summary: | After upgrade, live migration is Pending | |||
---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Ruth Netser <rnetser> | |
Component: | Installation | Assignee: | Simone Tiraboschi <stirabos> | |
Status: | CLOSED ERRATA | QA Contact: | Debarati Basu-Nag <dbasunag> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.9.0 | CC: | cnv-qe-bugs, danken, dshchedr, dvossel, giridhar.ramaraju, ibesso, kbidarka, oramraz, pelauter, stirabos, ycui | |
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | hco-bundle-registry-container-v4.9.0-262 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2017802 2021992 (view as bug list) | Environment: | ||
Last Closed: | 2021-11-02 16:01:44 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2017573 | |||
Bug Blocks: | 2017802, 2021992 |
Description
Ruth Netser
2021-10-26 12:28:21 UTC
------------------ What's happening? ----------------- - CNV by default limits 2 in-flight live migrations per source node - QE environment has 2 migrations stuck in "pending" due to anti-affinity rules preventing target pod from scheduling. - all future migrations are blocked until these two migrations complete Here's the warning indicating why the target pods can't schedule --- "Warning FailedScheduling 3h52m default-scheduler 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 Insufficient bridge.network.kubevirt.io/upg-br-mark, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate." -------------------------- Why is this happening now? -------------------------- How did we get here when migrations in this environment worked in the past? Automated workload updates will attempt to migrate any VMI which is capable of being migrated (migratable condition on vmi set to true). These two VMIs which are stuck in pending due to anti-affinity were never migrated in the past because they did not have EvictionStrategy: LiveMigrate... But with automated workload updates these VMIs are technically capable of being migrated, so migration is attempted. What this means for people on previous CNV versions is that it is possible VMIs which never got migrated in the past during OCP updates (hard shutdown) will be migrated during automated workload updates. If the migrations get stuck in pending like this QE environment, then all future migrations are blocked until those VMIs are either migrated, restarted, or the global in-flight migration limit is increased. --------------- Path forward. --------------- The root cause of why migrations are being blocked is not new. It's always been possible for migration target nodes to get stuck in pending and block new migrations at the global level. What is new is that automated workload updates will attempt to migrate VMIs which may not have been migrated in the past, which can lead to this scenario. Here's the recommendation for moving forward. 4.9 - maintain status quo (don't make the situation worse) - Disable automatic opt-in of workload updates 4.9.1 - fix the blocked migration issue and enable workload updates. - Add logic to cancel pending migrations after 10m (which will cause a new randomly selected VMI to migrate) - Add pretty migration cli output, better migration events, and better logging to indicate when migrations are blocked due to global limits (this was way too difficult to understand) - Enable automatic workload updates now that pending migrations can timeout, allowing migrations to be unblocked. If we take this recommendation, 4.9 will maintain previous behavior (issue exists as it always has, but isn't exasperated by workload updates) and 4.9.1 will enable workload updates once we have logic to automatically unblock when migrations are stuck in pending indefinitely. Thoughts? (In reply to David Vossel from comment #2) > If we take this recommendation, 4.9 will maintain previous behavior (issue > exists as it always has, but isn't exasperated by workload updates) and > 4.9.1 will enable workload updates once we have logic to automatically > unblock when migrations are stuck in pending indefinitely. > > Thoughts? This is not going to solve the issue, but at least it will make it by far less impacting, at least not worse than it was before. I sent a couple of upstream PRs to change the HCO default for workloadUpdateMethods to []. David, You suggest a number of reasonable actions in Comment #2. A timeout and re-enabling migrations are both immediately straightforward. What do you feel the level of effort is for better CLI output? Simone, since the immediate problem is being handled at the HCO level, is Virt the best component? > What do you feel the level of effort is for better CLI output?
low. It is a matter of adding more fields to the migration CRD's printable columns. I'd like to see the target VMName and Phase fields added so i can easily pick out what migration belongs to what VM when i do a `oc get migrations`
Otherwise, we're stuck trying to match potentially 100s of migration objects with VMIs by introspecting migration yaml.
We should file a doc bug to amend 4.9.0 documentation and the release note. @kbidarka already validated this: =========================== Manual Upgrade of 4.8.2 to 4.9.0 with CIRROS 70VMIs was Successful. Each VM was with 256Mi Memory, and upon single node drain, it would load upto 95% each of the 2 nodes. Probably Cirros VM's can go even less than 256Mi, maybe 128Mi would also do. But something to try with the next time. Steps Performed: 1) Installed CNV 4.8.2 with OCP 4.8 2) Created 70VMs with CIRROS Images and started them Successfully. 3) Upgraded OCP 4.8 to 4.9.4 4) CNV 4.8.2 was running Successfully on 4.9.4 5) LiveMigrations were fine, during OCP Upgrade to 4.9.4 6) Upgraded CNV from 4.8.2 to 4.9.0 7) All CNV Components upgraded successfully to 4.9.0 8) As we dropped "LiveMigrate" as default behavior, virt-launcher pods were still using 4.8.2 version. 9) patched HCO CR with "LiveMigrate" to trigger a Mass VMI Migration. Post which, all the virt-launcher Pods LiveMigrated and also upgraded successfully to virt-launcher/images/v4.9.0-60 http://pastebin.test.redhat.com/1004548 ============================ We also have a good upgrade via jenkins job: ============================ https://main-jenkins-csb-cnvqe.apps.ocp-c1.prod.psi.redhat.com/job/cnv-tests-runner/1294/testReport/ Validated automatic opt-in of workload updates is disabled: ============================ spec: certConfig: ca: duration: 48h0m0s renewBefore: 24h0m0s server: duration: 24h0m0s renewBefore: 12h0m0s featureGates: sriovLiveMigration: false withHostPassthroughCPU: false infra: {} liveMigrationConfig: completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 5 parallelOutboundMigrationsPerNode: 2 progressTimeout: 150 workloads: {} status: =========================== Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4104 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |