Bug 2059338 - A fully upgraded 4.10 cluster defaults to HW-13 hardware version even if HW-15 is default (and supported)
Summary: A fully upgraded 4.10 cluster defaults to HW-13 hardware version even if HW-1...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: dmoiseev
QA Contact: Huali Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-28 19:43 UTC by Hemant Kumar
Modified: 2022-08-10 10:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Now machine controller checks VM's hardware version during new machine creation (template clone). Machine goes into 'Failed' state if template's hardware version is lower than 15, which is minimal supported HW version for OCP 4.11+.
Clone Of:
Environment:
Last Closed: 2022-08-10 10:51:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 1016 0 None open Bug 2059338: Add template HW version detection during clone 2022-05-26 10:40:25 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:51:50 UTC

Description Hemant Kumar 2022-02-28 19:43:58 UTC
If we upgrade a cluster from 4.8 to 4.9 and then 4.10 and manually upgrade Hardware version of all VMs (to HW-15) and start using CSI driver and then add a new VM via Machine API, then new VM defaults to HW-13 which results in cluster being degraded.

Expected Result:
If underlying vsphere platform can support it, new VMs should be created with HW-15 and cluster should not be degraded.

Comment 1 dmoiseev 2022-03-09 13:09:25 UTC
I think it's worth to check HW version of template which machine-api using for clone during new machine creation.

Comment 2 Joel Speed 2022-03-14 10:18:32 UTC
To add some more colour to this, there was some discussion on slack about this.

When we create a new VM in OCP, we typically create it based on a template.
The HW version is set by the template and is not something Machine API has control over.

The new VM will be created with whichever HW version the template was created with.

Now, templates can't be upgraded like VMs can, they have to be converted to a VM, then upgraded, then converted back.

The suggestion I think is that we fetch the template (this is done already) and then check the HW version of the template (I believe we can fetch this via the properties on the VM object), if the HW version is too low, we fail the machine and ask for an upgraded template.

The issue here is that users now need to intervene to upgrade the template, which is not trivial. The user must convert the template to a VM, then upgrade it, then convert back to a template.

Comment 3 Hemant Kumar 2022-03-14 15:05:34 UTC
Jan on slack brought another important point - how long MAO is going to keep using a template that was uploaded for 4.5-4.6 in 4.11 or 4.12 if there is no documented mechanism of updating the template? What if older template has security vulnerabilities? I know eventually the OS will get updated but it seems problematic to boot a VM using 4.5 template on 4.11 ? (correct me if I am wrong).

Comment 4 Hemant Kumar 2022-03-15 14:35:12 UTC
Do we know why we can't do what we did in installer - https://github.com/openshift/installer/pull/5163 ?

I hear the argument that MAO uses template as source of truth for HW version but if template is older then surely source of truth is incorrect. Shouldn't version of OCP be source of truth?

Comment 5 Joel Speed 2022-03-16 11:51:40 UTC
> Do we know why we can't do what we did in installer - https://github.com/openshift/installer/pull/5163 ?

We have no influence over the input image or template here. The customer configures a template as part of the Machine providerSpec. It is not up to us to modify that and attempt to upgrade it to a newer version.

> I hear the argument that MAO uses template as source of truth for HW version but if template is older then surely source of truth is incorrect. Shouldn't version of OCP be source of truth?

It's not that it's a source of truth in that way, it's that we have no influence on it.

If we clone a template, the new VM is of the HW version of the template. We can't change that.

What we could do I guess is attempt to upgrade the VM to a new version after it is create, though that seems like patching over the issue rather than solving it. The cluster would be temporarily degraded while the new VM was being upgraded.

IMO, the only sensible thing to do here is to report that the VM template is too old and needs to be upgraded. It will stop users from creating new VMs until their template is upgraded though which is not ideal, but if the cluster is degraded if they create a new VM, is it any worse?

Why exactly does the cluster go degraded if you have an older HW version?

Comment 6 Michael McCune 2022-04-22 13:17:27 UTC
curious if we have an update here? @jspeed or @dmoiseev

Comment 7 Joseph Callen 2022-04-22 13:30:43 UTC
For 4.11 the coreos folks switched to hardware 15 by default for the OVA. Not sure that exactly helps this specific case.

> IMO, the only sensible thing to do here is to report that the VM template is too old and needs to be upgraded.

^ I think this is the right answer. If the template is still hardware 13 on vsphere 6.7u2 or greater than its been around a while.
Since 4.9 installer determines the ESXi version of all the nodes in the cluster and sets the hardware version accordingly.

Comment 8 Joel Speed 2022-04-27 10:15:43 UTC
I had left a couple of questions in my previous comment, if anyone has any thoughts on those, opinions would be appreciated.

Comment 9 Hemant Kumar 2022-04-27 13:46:50 UTC
> If we clone a template, the new VM is of the HW version of the template. We can't change that.

After cloning the template to a VM and *before* starting the VM - don't we have an opportunity to upgrade Hardware version? I see a point that - if template is pretty old, it may not be safe to update hardware version but I am not sure if that was your point.

Comment 10 Hemant Kumar 2022-04-27 13:51:30 UTC
Oops sorry - I also don't know how MAO scales the VMs. If all it does is - asks vcenter to create more copies and start them without individually cloning the template to VM and starting them, then yeah we have a problem. It is bit like maintaining a ASG in AWS vs an operator creating and destroying nodes as machineset count is changed. 

> IMO, the only sensible thing to do here is to report that the VM template is too old and needs to be upgraded. It will stop users from creating new VMs until their template is upgraded though which is not ideal, but if the cluster is degraded if they create a new VM, is it any worse?


I like this idea. Is there any procedure in place that documents how to update the VM template? We are going to need alerts at very minimum if we can detect that template customer is using is too old.

Comment 11 Joseph Callen 2022-04-27 13:58:28 UTC
There is a KCS article 
https://access.redhat.com/articles/6090681

Comment 12 Hemant Kumar 2022-04-27 15:27:39 UTC
Okay lets move that KCS may be to official document as part of upgrade procedure and add alerts for detecting older templates if we can't automatically set HW version for created VMs.That solution works for me.

Comment 13 Joel Speed 2022-05-09 11:45:26 UTC
I was looking into detecting the older templates, as far as I can tell the only way to do that is to clone the template into a VM, do any of the vSphere experts know of a way we can add this check to the problem detector without having to clone to a VM first?

Comment 14 dmoiseev 2022-05-13 13:59:10 UTC
@jspeed i did look into vsphere problem detector, AFAIK it know nothing about machines and machinesets. We will need to look into machinesets there first to determine where templates is. Don't know if this is what we want.

Comment 15 Hemant Kumar 2022-05-13 14:12:02 UTC
It is going to be tricky to perform such an expensive check all the time in vsphere-problem-detector. I wonder if this is something we should *only* perform on pre-flight upgrade checks. Currently in vsphere-problem-detector we do not have mechanism to detect upgrade scenario.

Comment 18 Huali Liu 2022-06-06 11:20:03 UTC
Verified on 4.11.0-0.nightly-2022-06-04-014713

1.Upgrade a cluster from 4.8 to 4.9 to 4.10 and then 4.11

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.42    True        False         59m     Cluster version is 4.8.42

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.37    True        False         22m     Cluster version is 4.9.37

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.17   True        False         34m     Cluster version is 4.10.17

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-04-014713   True        False         8m40s   Cluster version is 4.11.0-0.nightly-2022-06-04-014713

2.Create a machineset, check machine create failed with "Hardware lower than 15 is not supported"

liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms2.yaml 
machineset.machine.openshift.io/huliu-vsphere4-8sdlm-test created
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                PHASE     TYPE   REGION   ZONE   AGE
huliu-vsphere4-8sdlm-master-0       Running                          8h
huliu-vsphere4-8sdlm-master-1       Running                          8h
huliu-vsphere4-8sdlm-master-2       Running                          8h
huliu-vsphere4-8sdlm-test-pmwfd     Failed                           6s
huliu-vsphere4-8sdlm-worker-6nhx5   Running                          8h
huliu-vsphere4-8sdlm-worker-njk88   Running                          8h
liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-vsphere4-8sdlm-test-pmwfd -o yaml
...
  errorMessage: 'Hardware lower than 15 is not supported, clone stopped. Detected
    machine template version is 13. Please update machine template: https://access.redhat.com/articles/6090681'
  errorReason: InvalidConfiguration
  lastUpdated: "2022-06-06T11:00:30Z"
  phase: Failed
  providerStatus:
    conditions:
    - lastTransitionTime: "2022-06-06T11:00:30Z"
      message: 'Hardware lower than 15 is not supported, clone stopped. Detected machine
        template version is 13. Please update machine template: https://access.redhat.com/articles/6090681'
      reason: MachineCreationSucceeded
      status: "False"
      type: MachineCreation

Comment 20 errata-xmlrpc 2022-08-10 10:51:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.