If we upgrade a cluster from 4.8 to 4.9 and then 4.10 and manually upgrade Hardware version of all VMs (to HW-15) and start using CSI driver and then add a new VM via Machine API, then new VM defaults to HW-13 which results in cluster being degraded. Expected Result: If underlying vsphere platform can support it, new VMs should be created with HW-15 and cluster should not be degraded.
I think it's worth to check HW version of template which machine-api using for clone during new machine creation.
To add some more colour to this, there was some discussion on slack about this. When we create a new VM in OCP, we typically create it based on a template. The HW version is set by the template and is not something Machine API has control over. The new VM will be created with whichever HW version the template was created with. Now, templates can't be upgraded like VMs can, they have to be converted to a VM, then upgraded, then converted back. The suggestion I think is that we fetch the template (this is done already) and then check the HW version of the template (I believe we can fetch this via the properties on the VM object), if the HW version is too low, we fail the machine and ask for an upgraded template. The issue here is that users now need to intervene to upgrade the template, which is not trivial. The user must convert the template to a VM, then upgrade it, then convert back to a template.
Jan on slack brought another important point - how long MAO is going to keep using a template that was uploaded for 4.5-4.6 in 4.11 or 4.12 if there is no documented mechanism of updating the template? What if older template has security vulnerabilities? I know eventually the OS will get updated but it seems problematic to boot a VM using 4.5 template on 4.11 ? (correct me if I am wrong).
Do we know why we can't do what we did in installer - https://github.com/openshift/installer/pull/5163 ? I hear the argument that MAO uses template as source of truth for HW version but if template is older then surely source of truth is incorrect. Shouldn't version of OCP be source of truth?
> Do we know why we can't do what we did in installer - https://github.com/openshift/installer/pull/5163 ? We have no influence over the input image or template here. The customer configures a template as part of the Machine providerSpec. It is not up to us to modify that and attempt to upgrade it to a newer version. > I hear the argument that MAO uses template as source of truth for HW version but if template is older then surely source of truth is incorrect. Shouldn't version of OCP be source of truth? It's not that it's a source of truth in that way, it's that we have no influence on it. If we clone a template, the new VM is of the HW version of the template. We can't change that. What we could do I guess is attempt to upgrade the VM to a new version after it is create, though that seems like patching over the issue rather than solving it. The cluster would be temporarily degraded while the new VM was being upgraded. IMO, the only sensible thing to do here is to report that the VM template is too old and needs to be upgraded. It will stop users from creating new VMs until their template is upgraded though which is not ideal, but if the cluster is degraded if they create a new VM, is it any worse? Why exactly does the cluster go degraded if you have an older HW version?
curious if we have an update here? @jspeed or @dmoiseev
For 4.11 the coreos folks switched to hardware 15 by default for the OVA. Not sure that exactly helps this specific case. > IMO, the only sensible thing to do here is to report that the VM template is too old and needs to be upgraded. ^ I think this is the right answer. If the template is still hardware 13 on vsphere 6.7u2 or greater than its been around a while. Since 4.9 installer determines the ESXi version of all the nodes in the cluster and sets the hardware version accordingly.
I had left a couple of questions in my previous comment, if anyone has any thoughts on those, opinions would be appreciated.
> If we clone a template, the new VM is of the HW version of the template. We can't change that. After cloning the template to a VM and *before* starting the VM - don't we have an opportunity to upgrade Hardware version? I see a point that - if template is pretty old, it may not be safe to update hardware version but I am not sure if that was your point.
Oops sorry - I also don't know how MAO scales the VMs. If all it does is - asks vcenter to create more copies and start them without individually cloning the template to VM and starting them, then yeah we have a problem. It is bit like maintaining a ASG in AWS vs an operator creating and destroying nodes as machineset count is changed. > IMO, the only sensible thing to do here is to report that the VM template is too old and needs to be upgraded. It will stop users from creating new VMs until their template is upgraded though which is not ideal, but if the cluster is degraded if they create a new VM, is it any worse? I like this idea. Is there any procedure in place that documents how to update the VM template? We are going to need alerts at very minimum if we can detect that template customer is using is too old.
There is a KCS article https://access.redhat.com/articles/6090681
Okay lets move that KCS may be to official document as part of upgrade procedure and add alerts for detecting older templates if we can't automatically set HW version for created VMs.That solution works for me.
I was looking into detecting the older templates, as far as I can tell the only way to do that is to clone the template into a VM, do any of the vSphere experts know of a way we can add this check to the problem detector without having to clone to a VM first?
@jspeed i did look into vsphere problem detector, AFAIK it know nothing about machines and machinesets. We will need to look into machinesets there first to determine where templates is. Don't know if this is what we want.
It is going to be tricky to perform such an expensive check all the time in vsphere-problem-detector. I wonder if this is something we should *only* perform on pre-flight upgrade checks. Currently in vsphere-problem-detector we do not have mechanism to detect upgrade scenario.
Verified on 4.11.0-0.nightly-2022-06-04-014713 1.Upgrade a cluster from 4.8 to 4.9 to 4.10 and then 4.11 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.42 True False 59m Cluster version is 4.8.42 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.37 True False 22m Cluster version is 4.9.37 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.17 True False 34m Cluster version is 4.10.17 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-04-014713 True False 8m40s Cluster version is 4.11.0-0.nightly-2022-06-04-014713 2.Create a machineset, check machine create failed with "Hardware lower than 15 is not supported" liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms2.yaml machineset.machine.openshift.io/huliu-vsphere4-8sdlm-test created liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vsphere4-8sdlm-master-0 Running 8h huliu-vsphere4-8sdlm-master-1 Running 8h huliu-vsphere4-8sdlm-master-2 Running 8h huliu-vsphere4-8sdlm-test-pmwfd Failed 6s huliu-vsphere4-8sdlm-worker-6nhx5 Running 8h huliu-vsphere4-8sdlm-worker-njk88 Running 8h liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-vsphere4-8sdlm-test-pmwfd -o yaml ... errorMessage: 'Hardware lower than 15 is not supported, clone stopped. Detected machine template version is 13. Please update machine template: https://access.redhat.com/articles/6090681' errorReason: InvalidConfiguration lastUpdated: "2022-06-06T11:00:30Z" phase: Failed providerStatus: conditions: - lastTransitionTime: "2022-06-06T11:00:30Z" message: 'Hardware lower than 15 is not supported, clone stopped. Detected machine template version is 13. Please update machine template: https://access.redhat.com/articles/6090681' reason: MachineCreationSucceeded status: "False" type: MachineCreation
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069