Bug 1862701 - OpenShift Virtualization defaults to LiveMigration eviction strategy even when not available (VM creation fails)
Summary: OpenShift Virtualization defaults to LiveMigration eviction strategy even whe...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Installation
Version: 2.4.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Simone Tiraboschi
QA Contact: ibesso
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-01 14:33 UTC by Stephen Gordon
Modified: 2021-07-27 14:21 UTC (History)
6 users (show)

Fixed In Version: hco-bundle-registry:v4.8.0-312
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 14:20:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
CREATE-VM1 (77.88 KB, image/png)
2020-08-01 14:33 UTC, Stephen Gordon
no flags Details
CREATE-VM2 (65.38 KB, image/png)
2020-08-01 14:34 UTC, Stephen Gordon
no flags Details
CREATE-VM3 (63.94 KB, image/png)
2020-08-01 14:34 UTC, Stephen Gordon
no flags Details
CREATEVM-4 (113.11 KB, image/png)
2020-08-01 14:34 UTC, Stephen Gordon
no flags Details
EVICTION-FAILURE (74.44 KB, image/png)
2020-08-01 14:35 UTC, Stephen Gordon
no flags Details
kubevirt-config CM (119.65 KB, image/png)
2020-08-07 09:36 UTC, Simone Tiraboschi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt hyperconverged-cluster-operator pull 1173 0 None closed Remove KubeVirt ConfigMap 2021-05-04 15:40:46 UTC
Github kubevirt hyperconverged-cluster-operator pull 753 0 None closed Continously reconcile all the resources created by HCO 2021-01-27 13:39:02 UTC
Red Hat Product Errata RHSA-2021:2920 0 None None None 2021-07-27 14:21:52 UTC

Internal Links: 1867122

Description Stephen Gordon 2020-08-01 14:33:29 UTC
Created attachment 1703169 [details]
CREATE-VM1

Description of problem:

Creation of VM fails with Error "LiveMigration feature gate is not enabled" for field "spec.template.spec.evictionStrategy".

Version-Release number of selected component (if applicable):

2.4.0

How reproducible:

Very

Steps to Reproduce:
1. Create VM using wizard, select Fedora 31 or above and use Fedora 32 QCOW2 URL.
2. Accept defaults for storage networking.
3. Create fails immediately on a cluster that is not enabled for live migration.

Actual results:

Error "LiveMigration feature gate is not enabled" for field "spec.template.spec.evictionStrategy".

Expected results:

1. VM create wizard should not default to LiveMigration eviction strategy if not available on the cluster.
2. As a user I had no ability to edit this via the UI (not even a hatch to hand edit the XML).

Additional info:

Comment 1 Stephen Gordon 2020-08-01 14:34:09 UTC
Created attachment 1703170 [details]
CREATE-VM2

Comment 2 Stephen Gordon 2020-08-01 14:34:25 UTC
Created attachment 1703171 [details]
CREATE-VM3

Comment 3 Stephen Gordon 2020-08-01 14:34:41 UTC
Created attachment 1703172 [details]
CREATEVM-4

Comment 4 Stephen Gordon 2020-08-01 14:35:01 UTC
Created attachment 1703173 [details]
EVICTION-FAILURE

Comment 5 Stephen Gordon 2020-08-01 14:43:04 UTC
Just to be clear when I say "As a user I had no ability to edit this via the UI (not even a hatch to hand edit the XML^WYAML)." - I know I can copy the YAML from the error back all the way out of the wizard and come back in as create from YAML, but this seems suboptimal.

Comment 6 Adam Litke 2020-08-04 17:14:42 UTC
None of: the user, the UI, or the template can realistically know if live migration is supportable for a given VM.  Only kubevirt knows this (at the time a migration is attempted).  Perhaps we should add two new evictionStrategies to cover the desired fallback behavior when a LiveMigration is not possible or fails:

1. LiveMigrateOrShutdown: migrate the VM if possible, otherwise shut it down.
- LiveMigrateOrBlock: migrate the VM if possible, otherwise keep the VM running.

Option 1 would be useful if you can tolerate some downtime in order to facilitate cluster maintenance or upgrades.
Option 2 should be used when manual intervention is required or to prevent unplanned downtime.

Comment 7 Roman Mohr 2020-08-05 12:10:06 UTC
(In reply to Adam Litke from comment #6)
> None of: the user, the UI, or the template can realistically know if live
> migration is supportable for a given VM.  Only kubevirt knows this (at the
> time a migration is attempted).  Perhaps we should add two new
> evictionStrategies to cover the desired fallback behavior when a
> LiveMigration is not possible or fails:
> 
> 1. LiveMigrateOrShutdown: migrate the VM if possible, otherwise shut it down.
> - LiveMigrateOrBlock: migrate the VM if possible, otherwise keep the VM
> running.
> 
> Option 1 would be useful if you can tolerate some downtime in order to
> facilitate cluster maintenance or upgrades.
> Option 2 should be used when manual intervention is required or to prevent
> unplanned downtime.

We can rethink if we now want to enable migrations in general upstream. But the feature-gate is there to indicate that something is not quite good enough to be used yet.
For OpenShift Virtualization the decision was to enable migrations. The feature-gate should just be enabled. There is not UI issue from that perspective. The only question is why the feature gate is not enabled although it should be.

Comment 8 Roman Mohr 2020-08-05 12:13:54 UTC
(In reply to Adam Litke from comment #6)
> None of: the user, the UI, or the template can realistically know if live
> migration is supportable for a given VM.  Only kubevirt knows this (at the
> time a migration is attempted).  Perhaps we should add two new
> evictionStrategies to cover the desired fallback behavior when a
> LiveMigration is not possible or fails:
> 
> 1. LiveMigrateOrShutdown: migrate the VM if possible, otherwise shut it down.
> - LiveMigrateOrBlock: migrate the VM if possible, otherwise keep the VM
> running.
> 
> Option 1 would be useful if you can tolerate some downtime in order to
> facilitate cluster maintenance or upgrades.
> Option 2 should be used when manual intervention is required or to prevent
> unplanned downtime.

Yes that could be a nice addition. But first let's ensure that the feature gate is enabled again, as it should.

Comment 9 Roman Mohr 2020-08-05 12:15:02 UTC
Simone, this may be in the HCO area. Could you have a look?

Comment 10 Simone Tiraboschi 2020-08-06 16:47:22 UTC
The relevant piece of code is here:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/controller/hyperconverged/hyperconverged_controller.go#L779-L811

Only the values for virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey and virtconfig.UseEmulationKey take their values from env variables defined at CSV level.
Everything else (including virtconfig.FeatureGatesKey) uses a value that is exactly the same between upstream and downstream.

Only virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey, virtconfig.SELinuxLauncherTypeKey and
virtconfig.UseEmulationKey are going to be manipulated by HCO and only during HCO upgrades.

This means that virtconfig.FeatureGatesKey is not currently going to be modified by HCO at all.

This because a ConfigMap is generally considered as a store for configuration data decoupling them from application code; the admin tend to think that he can eventually change it and his modifications are going to survive there.
Having an operator that continuously reconcile a ConfigMap from data embedded in the code of the operator itself is a kind of anti-pattern.

Comment 11 Roman Mohr 2020-08-06 18:35:21 UTC
(In reply to Simone Tiraboschi from comment #10)
> The relevant piece of code is here:
> https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/
> controller/hyperconverged/hyperconverged_controller.go#L779-L811
> 
> Only the values for virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey
> and virtconfig.UseEmulationKey take their values from env variables defined
> at CSV level.
> Everything else (including virtconfig.FeatureGatesKey) uses a value that is
> exactly the same between upstream and downstream.
> 
> Only virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey,
> virtconfig.SELinuxLauncherTypeKey and
> virtconfig.UseEmulationKey are going to be manipulated by HCO and only
> during HCO upgrades.
> 
> This means that virtconfig.FeatureGatesKey is not currently going to be
> modified by HCO at all.
> 
> This because a ConfigMap is generally considered as a store for
> configuration data decoupling them from application code; the admin tend to
> think that he can eventually change it and his modifications are going to
> survive there.
> Having an operator that continuously reconcile a ConfigMap from data
> embedded in the code of the operator itself is a kind of anti-pattern.

Not sure I understand your reply. Is there another issue where we look into how Steve ended up with a cluster without migration enabled? If not, my request would have been to understand how this could happen. I can't follow your response.

Comment 12 Simone Tiraboschi 2020-08-07 09:24:07 UTC
I was just trying to say that LiveMigration feature-gate is always set by HCO upstream and downstream on kubevirt-config config map and, currently, we have no way to handle it differently from upstream to downstream or manipulating it during upgrades.

Comment 13 Simone Tiraboschi 2020-08-07 09:36:18 UTC
Created attachment 1710774 [details]
kubevirt-config CM

Comment 14 Simone Tiraboschi 2020-08-07 09:37:41 UTC
I just tried reproducing it and, as expected, I can confirm that HCO configures LiveMigration feature-gate in kubevirt-config CM.

Comment 15 Simone Tiraboschi 2020-08-07 11:38:23 UTC
I tried reproducing the VM exactly as described but the VM correctly (I'm going to open another UX bug about a lot of noise events during the startup process) started for me.

Stephen, do you still have that environment?
Was it deployed from scratch starting with 2.4.0 or is it an upgrade from the past?
Can you please share the output of:
 oc get ConfigMap -n openshift-cnv kubevirt-config -o yaml

Comment 16 Simone Tiraboschi 2020-08-07 11:59:40 UTC
(In reply to Simone Tiraboschi from comment #15)
> I tried reproducing the VM exactly as described but the VM correctly (I'm
> going to open another UX bug about a lot of noise events during the startup
> process) started for me.

see also: https://bugzilla.redhat.com/1867122

Comment 17 Stephen Gordon 2020-08-12 13:26:44 UTC
(In reply to Simone Tiraboschi from comment #15)
> I tried reproducing the VM exactly as described but the VM correctly (I'm
> going to open another UX bug about a lot of noise events during the startup
> process) started for me.
> 
> Stephen, do you still have that environment?
> Was it deployed from scratch starting with 2.4.0 or is it an upgrade from
> the past?
> Can you please share the output of:
>  oc get ConfigMap -n openshift-cnv kubevirt-config -o yaml

I don't, but I can do it again. It seemed pretty consistently reproducible. This was a fresh install.

Comment 18 Stephen Gordon 2020-08-14 16:35:56 UTC
kind: ConfigMap
apiVersion: v1
metadata:
  selfLink: /api/v1/namespaces/openshift-cnv/configmaps/kubevirt-config
  resourceVersion: '31254'
  name: kubevirt-config
  uid: 1b1f3871-ca08-4952-b5b2-e4d63c37e979
  creationTimestamp: '2020-08-14T16:21:16Z'
  managedFields:
    - manager: hyperconverged-cluster-operator
      operation: Update
      apiVersion: v1
      time: '2020-08-14T16:21:16Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:data':
          .: {}
          'f:feature-gates': {}
          'f:machine-type': {}
          'f:migrations': {}
          'f:selinuxLauncherType': {}
          'f:smbios': {}
        'f:metadata':
          'f:labels':
            .: {}
            'f:app': {}
          'f:ownerReferences':
            .: {}
            'k:{"uid":"dee1214a-7ee0-4386-8815-10ae7b091a51"}':
              .: {}
              'f:apiVersion': {}
              'f:blockOwnerDeletion': {}
              'f:controller': {}
              'f:kind': {}
              'f:name': {}
              'f:uid': {}
  namespace: openshift-cnv
  ownerReferences:
    - apiVersion: hco.kubevirt.io/v1alpha1
      kind: HyperConverged
      name: kubevirt-hyperconverged
      uid: dee1214a-7ee0-4386-8815-10ae7b091a51
      controller: true
      blockOwnerDeletion: true
  labels:
    app: kubevirt-hyperconverged
data:
  feature-gates: 'DataVolumes,SRIOV,LiveMigration,CPUManager,CPUNodeDiscovery,Sidecar'
  machine-type: pc-q35-rhel8.2.0
  migrations: '{"nodeDrainTaintKey" : "node.kubernetes.io/unschedulable"}'
  selinuxLauncherType: virt_launcher.process
  smbios: |-
    Family: Red Hat
    Product: Container-native virtualization
    Manufacturer: Red Hat
    Sku: 2.4.0
    Version: 2.4.0

Comment 19 Stephen Gordon 2020-08-14 16:37:57 UTC
Notably though, this time my virtual machine actually ran. I am wondering if the original issue was that I followed one of our instruction sets where they was a config entry to create before deploying (e.g. useEmulation). In that situation I would have already had a (pretty empty) kubevirt-config but I am guessing the installation doesn't attempt to reconcile this at all and just assumes if one is there it covers everything needed.

Comment 20 Fabian Deutsch 2020-08-17 11:58:50 UTC
Steve, yes - that sounds like to root cause. Usually the "useEmulation" snippets do not contain the feature gate to also enable live migration.

We learned that HCO is configuring CNV to enable live migration, thus this is good.
Because Live Migration is enabled by default in CNV, it is ok for the templates to rely on this and enable features (evictionStrategy) to assume it.

I see two ways:
1. Have HCO also reconciling kubevirt-config - WHich is ok, because HCO is an operator, and has the operational knowledge of kubevirt-config should look.
2. Introduce an alert to fire when this bug - or inconsistency - is discovered

For (1) if a user wants to do custom configurations with the kubevirt-config, then the operator needs to be paused ( a common pattern).

Moving this to HCO in order to implement (1)

Comment 21 Simone Tiraboschi 2020-08-17 15:00:18 UTC
(In reply to Fabian Deutsch from comment #20)
> Steve, yes - that sounds like to root cause. Usually the "useEmulation"
> snippets do not contain the feature gate to also enable live migration.
> 
> We learned that HCO is configuring CNV to enable live migration, thus this
> is good.
> Because Live Migration is enabled by default in CNV, it is ok for the
> templates to rely on this and enable features (evictionStrategy) to assume
> it.
> 
> I see two ways:
> 1. Have HCO also reconciling kubevirt-config - WHich is ok, because HCO is
> an operator, and has the operational knowledge of kubevirt-config should
> look.

Currently HCO is already reconciling it but only virtconfig.SmbiosConfigKey, virtconfig.MachineTypeKey, virtconfig.SELinuxLauncherTypeKey and
virtconfig.UseEmulationKey are going to be manipulated by HCO and only during HCO upgrades.
This means that virtconfig.FeatureGatesKey is not currently going to be modified by HCO at all.

Are we talking about adding virtconfig.FeatureGatesKey to that list or about continuously reconciling the whole ConfigMap?

Comment 22 Fabian Deutsch 2020-08-17 15:11:31 UTC
HCO is managing kubevirt, thus it must be managing the complete kubevirt-config.


If somebody is modifying kubevirt-config in whatever regards, then HCO needs to reconcile it back into a known good state.
And: If a user wants to do a customization, then HCO needs to be paused/scaled down to 0.

The key message is: So far - accross all of HCO and operands - our reconcileation loops were not covering all aspects (i.e. also not the placement fields of DS), but we need to improve all of CNV in order to make sure that all aspects are always reconciled. Customizations are only possible with paused operators.

Comment 25 Simone Tiraboschi 2021-05-04 15:40:46 UTC
All the relevant values has been migrated from KubeVirt ConfigMap to kubevirt CR which is completely managed by HCO.
Moving this to modified.

Comment 26 Inbar Rose 2021-05-31 06:00:09 UTC
@ibesso  this is part of the strict reconciliation tests. when you verify it there, you can verify this bug.

Comment 27 ibesso 2021-06-10 17:33:34 UTC
Verified on 4.8.0
-----------------
registry-proxy.engineering.redhat.com/rh-osbs/iib:80270
HCO:[v4.8.0-388]

+ HCO CR had the defaults:
$ oc get hco -n openshift-cnv  kubevirt-hyperconverged -o json | jq '.spec.liveMigrationConfig'
{
  "bandwidthPerMigration": "64Mi",
  "completionTimeoutPerGiB": 800,
  "parallelMigrationsPerCluster": 5,
  "parallelOutboundMigrationsPerNode": 2,
  "progressTimeout": 150
}

+ KubeVirt CR had the defaults:
$ oc get kubevirt -n openshift-cnv  kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.migrations'
{
  "bandwidthPerMigration": "64Mi",
  "completionTimeoutPerGiB": 800,
  "parallelMigrationsPerCluster": 5,
  "parallelOutboundMigrationsPerNode": 2,
  "progressTimeout": 150
}


Modification of HCO CR propagated to KubeVirt CR (modified the first two fields):
$ oc get kubevirt -n openshift-cnv  kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.migrations'
{
  "bandwidthPerMigration": "32Mi",
  "completionTimeoutPerGiB": 700,
  "parallelMigrationsPerCluster": 5,
  "parallelOutboundMigrationsPerNode": 2,
  "progressTimeout": 150
}

Moving to VERIFIED.

Comment 30 errata-xmlrpc 2021-07-27 14:20:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920


Note You need to log in before you can comment on or make changes to this bug.