Bug 1862701
Summary: | OpenShift Virtualization defaults to LiveMigration eviction strategy even when not available (VM creation fails) | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Stephen Gordon <sgordon> | ||||||||||||||
Component: | Installation | Assignee: | Simone Tiraboschi <stirabos> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | ibesso <ibesso> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 2.4.0 | CC: | alitke, cnv-qe-bugs, fdeutsch, ibesso, rmohr, stirabos | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | 4.8.0 | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | hco-bundle-registry:v4.8.0-312 | Doc Type: | If docs needed, set a value | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2021-07-27 14:20:49 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Created attachment 1703170 [details]
CREATE-VM2
Created attachment 1703171 [details]
CREATE-VM3
Created attachment 1703172 [details]
CREATEVM-4
Created attachment 1703173 [details]
EVICTION-FAILURE
Just to be clear when I say "As a user I had no ability to edit this via the UI (not even a hatch to hand edit the XML^WYAML)." - I know I can copy the YAML from the error back all the way out of the wizard and come back in as create from YAML, but this seems suboptimal. None of: the user, the UI, or the template can realistically know if live migration is supportable for a given VM. Only kubevirt knows this (at the time a migration is attempted). Perhaps we should add two new evictionStrategies to cover the desired fallback behavior when a LiveMigration is not possible or fails: 1. LiveMigrateOrShutdown: migrate the VM if possible, otherwise shut it down. - LiveMigrateOrBlock: migrate the VM if possible, otherwise keep the VM running. Option 1 would be useful if you can tolerate some downtime in order to facilitate cluster maintenance or upgrades. Option 2 should be used when manual intervention is required or to prevent unplanned downtime. (In reply to Adam Litke from comment #6) > None of: the user, the UI, or the template can realistically know if live > migration is supportable for a given VM. Only kubevirt knows this (at the > time a migration is attempted). Perhaps we should add two new > evictionStrategies to cover the desired fallback behavior when a > LiveMigration is not possible or fails: > > 1. LiveMigrateOrShutdown: migrate the VM if possible, otherwise shut it down. > - LiveMigrateOrBlock: migrate the VM if possible, otherwise keep the VM > running. > > Option 1 would be useful if you can tolerate some downtime in order to > facilitate cluster maintenance or upgrades. > Option 2 should be used when manual intervention is required or to prevent > unplanned downtime. We can rethink if we now want to enable migrations in general upstream. But the feature-gate is there to indicate that something is not quite good enough to be used yet. For OpenShift Virtualization the decision was to enable migrations. The feature-gate should just be enabled. There is not UI issue from that perspective. The only question is why the feature gate is not enabled although it should be. (In reply to Adam Litke from comment #6) > None of: the user, the UI, or the template can realistically know if live > migration is supportable for a given VM. Only kubevirt knows this (at the > time a migration is attempted). Perhaps we should add two new > evictionStrategies to cover the desired fallback behavior when a > LiveMigration is not possible or fails: > > 1. LiveMigrateOrShutdown: migrate the VM if possible, otherwise shut it down. > - LiveMigrateOrBlock: migrate the VM if possible, otherwise keep the VM > running. > > Option 1 would be useful if you can tolerate some downtime in order to > facilitate cluster maintenance or upgrades. > Option 2 should be used when manual intervention is required or to prevent > unplanned downtime. Yes that could be a nice addition. But first let's ensure that the feature gate is enabled again, as it should. Simone, this may be in the HCO area. Could you have a look? The relevant piece of code is here: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/controller/hyperconverged/hyperconverged_controller.go#L779-L811 Only the values for virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey and virtconfig.UseEmulationKey take their values from env variables defined at CSV level. Everything else (including virtconfig.FeatureGatesKey) uses a value that is exactly the same between upstream and downstream. Only virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey, virtconfig.SELinuxLauncherTypeKey and virtconfig.UseEmulationKey are going to be manipulated by HCO and only during HCO upgrades. This means that virtconfig.FeatureGatesKey is not currently going to be modified by HCO at all. This because a ConfigMap is generally considered as a store for configuration data decoupling them from application code; the admin tend to think that he can eventually change it and his modifications are going to survive there. Having an operator that continuously reconcile a ConfigMap from data embedded in the code of the operator itself is a kind of anti-pattern. (In reply to Simone Tiraboschi from comment #10) > The relevant piece of code is here: > https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/ > controller/hyperconverged/hyperconverged_controller.go#L779-L811 > > Only the values for virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey > and virtconfig.UseEmulationKey take their values from env variables defined > at CSV level. > Everything else (including virtconfig.FeatureGatesKey) uses a value that is > exactly the same between upstream and downstream. > > Only virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey, > virtconfig.SELinuxLauncherTypeKey and > virtconfig.UseEmulationKey are going to be manipulated by HCO and only > during HCO upgrades. > > This means that virtconfig.FeatureGatesKey is not currently going to be > modified by HCO at all. > > This because a ConfigMap is generally considered as a store for > configuration data decoupling them from application code; the admin tend to > think that he can eventually change it and his modifications are going to > survive there. > Having an operator that continuously reconcile a ConfigMap from data > embedded in the code of the operator itself is a kind of anti-pattern. Not sure I understand your reply. Is there another issue where we look into how Steve ended up with a cluster without migration enabled? If not, my request would have been to understand how this could happen. I can't follow your response. I was just trying to say that LiveMigration feature-gate is always set by HCO upstream and downstream on kubevirt-config config map and, currently, we have no way to handle it differently from upstream to downstream or manipulating it during upgrades. Created attachment 1710774 [details]
kubevirt-config CM
I just tried reproducing it and, as expected, I can confirm that HCO configures LiveMigration feature-gate in kubevirt-config CM. I tried reproducing the VM exactly as described but the VM correctly (I'm going to open another UX bug about a lot of noise events during the startup process) started for me. Stephen, do you still have that environment? Was it deployed from scratch starting with 2.4.0 or is it an upgrade from the past? Can you please share the output of: oc get ConfigMap -n openshift-cnv kubevirt-config -o yaml (In reply to Simone Tiraboschi from comment #15) > I tried reproducing the VM exactly as described but the VM correctly (I'm > going to open another UX bug about a lot of noise events during the startup > process) started for me. see also: https://bugzilla.redhat.com/1867122 (In reply to Simone Tiraboschi from comment #15) > I tried reproducing the VM exactly as described but the VM correctly (I'm > going to open another UX bug about a lot of noise events during the startup > process) started for me. > > Stephen, do you still have that environment? > Was it deployed from scratch starting with 2.4.0 or is it an upgrade from > the past? > Can you please share the output of: > oc get ConfigMap -n openshift-cnv kubevirt-config -o yaml I don't, but I can do it again. It seemed pretty consistently reproducible. This was a fresh install. kind: ConfigMap apiVersion: v1 metadata: selfLink: /api/v1/namespaces/openshift-cnv/configmaps/kubevirt-config resourceVersion: '31254' name: kubevirt-config uid: 1b1f3871-ca08-4952-b5b2-e4d63c37e979 creationTimestamp: '2020-08-14T16:21:16Z' managedFields: - manager: hyperconverged-cluster-operator operation: Update apiVersion: v1 time: '2020-08-14T16:21:16Z' fieldsType: FieldsV1 fieldsV1: 'f:data': .: {} 'f:feature-gates': {} 'f:machine-type': {} 'f:migrations': {} 'f:selinuxLauncherType': {} 'f:smbios': {} 'f:metadata': 'f:labels': .: {} 'f:app': {} 'f:ownerReferences': .: {} 'k:{"uid":"dee1214a-7ee0-4386-8815-10ae7b091a51"}': .: {} 'f:apiVersion': {} 'f:blockOwnerDeletion': {} 'f:controller': {} 'f:kind': {} 'f:name': {} 'f:uid': {} namespace: openshift-cnv ownerReferences: - apiVersion: hco.kubevirt.io/v1alpha1 kind: HyperConverged name: kubevirt-hyperconverged uid: dee1214a-7ee0-4386-8815-10ae7b091a51 controller: true blockOwnerDeletion: true labels: app: kubevirt-hyperconverged data: feature-gates: 'DataVolumes,SRIOV,LiveMigration,CPUManager,CPUNodeDiscovery,Sidecar' machine-type: pc-q35-rhel8.2.0 migrations: '{"nodeDrainTaintKey" : "node.kubernetes.io/unschedulable"}' selinuxLauncherType: virt_launcher.process smbios: |- Family: Red Hat Product: Container-native virtualization Manufacturer: Red Hat Sku: 2.4.0 Version: 2.4.0 Notably though, this time my virtual machine actually ran. I am wondering if the original issue was that I followed one of our instruction sets where they was a config entry to create before deploying (e.g. useEmulation). In that situation I would have already had a (pretty empty) kubevirt-config but I am guessing the installation doesn't attempt to reconcile this at all and just assumes if one is there it covers everything needed. Steve, yes - that sounds like to root cause. Usually the "useEmulation" snippets do not contain the feature gate to also enable live migration. We learned that HCO is configuring CNV to enable live migration, thus this is good. Because Live Migration is enabled by default in CNV, it is ok for the templates to rely on this and enable features (evictionStrategy) to assume it. I see two ways: 1. Have HCO also reconciling kubevirt-config - WHich is ok, because HCO is an operator, and has the operational knowledge of kubevirt-config should look. 2. Introduce an alert to fire when this bug - or inconsistency - is discovered For (1) if a user wants to do custom configurations with the kubevirt-config, then the operator needs to be paused ( a common pattern). Moving this to HCO in order to implement (1) (In reply to Fabian Deutsch from comment #20) > Steve, yes - that sounds like to root cause. Usually the "useEmulation" > snippets do not contain the feature gate to also enable live migration. > > We learned that HCO is configuring CNV to enable live migration, thus this > is good. > Because Live Migration is enabled by default in CNV, it is ok for the > templates to rely on this and enable features (evictionStrategy) to assume > it. > > I see two ways: > 1. Have HCO also reconciling kubevirt-config - WHich is ok, because HCO is > an operator, and has the operational knowledge of kubevirt-config should > look. Currently HCO is already reconciling it but only virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey, virtconfig.SELinuxLauncherTypeKey and virtconfig.UseEmulationKey are going to be manipulated by HCO and only during HCO upgrades. This means that virtconfig.FeatureGatesKey is not currently going to be modified by HCO at all. Are we talking about adding virtconfig.FeatureGatesKey to that list or about continuously reconciling the whole ConfigMap? HCO is managing kubevirt, thus it must be managing the complete kubevirt-config. If somebody is modifying kubevirt-config in whatever regards, then HCO needs to reconcile it back into a known good state. And: If a user wants to do a customization, then HCO needs to be paused/scaled down to 0. The key message is: So far - accross all of HCO and operands - our reconcileation loops were not covering all aspects (i.e. also not the placement fields of DS), but we need to improve all of CNV in order to make sure that all aspects are always reconciled. Customizations are only possible with paused operators. All the relevant values has been migrated from KubeVirt ConfigMap to kubevirt CR which is completely managed by HCO. Moving this to modified. @ibesso this is part of the strict reconciliation tests. when you verify it there, you can verify this bug. Verified on 4.8.0 ----------------- registry-proxy.engineering.redhat.com/rh-osbs/iib:80270 HCO:[v4.8.0-388] + HCO CR had the defaults: $ oc get hco -n openshift-cnv kubevirt-hyperconverged -o json | jq '.spec.liveMigrationConfig' { "bandwidthPerMigration": "64Mi", "completionTimeoutPerGiB": 800, "parallelMigrationsPerCluster": 5, "parallelOutboundMigrationsPerNode": 2, "progressTimeout": 150 } + KubeVirt CR had the defaults: $ oc get kubevirt -n openshift-cnv kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.migrations' { "bandwidthPerMigration": "64Mi", "completionTimeoutPerGiB": 800, "parallelMigrationsPerCluster": 5, "parallelOutboundMigrationsPerNode": 2, "progressTimeout": 150 } Modification of HCO CR propagated to KubeVirt CR (modified the first two fields): $ oc get kubevirt -n openshift-cnv kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.migrations' { "bandwidthPerMigration": "32Mi", "completionTimeoutPerGiB": 700, "parallelMigrationsPerCluster": 5, "parallelOutboundMigrationsPerNode": 2, "progressTimeout": 150 } Moving to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920 |
Created attachment 1703169 [details] CREATE-VM1 Description of problem: Creation of VM fails with Error "LiveMigration feature gate is not enabled" for field "spec.template.spec.evictionStrategy". Version-Release number of selected component (if applicable): 2.4.0 How reproducible: Very Steps to Reproduce: 1. Create VM using wizard, select Fedora 31 or above and use Fedora 32 QCOW2 URL. 2. Accept defaults for storage networking. 3. Create fails immediately on a cluster that is not enabled for live migration. Actual results: Error "LiveMigration feature gate is not enabled" for field "spec.template.spec.evictionStrategy". Expected results: 1. VM create wizard should not default to LiveMigration eviction strategy if not available on the cluster. 2. As a user I had no ability to edit this via the UI (not even a hatch to hand edit the XML). Additional info: