Bug 1917838 - MachineSet scaling from 0 is not available or evaluated incorrectly for the new or changed instance types
Summary: MachineSet scaling from 0 is not available or evaluated incorrectly for the n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On: 1896321 1942966
Blocks: 1918307
TreeView+ depends on / blocked
 
Reported: 2021-01-19 13:55 UTC by Joel Speed
Modified: 2021-03-25 12:04 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The Cluster Autoscaler relies on a static list of instance types with associated details (CPU/Memory) to make decisions when scaling a MachineSet from Zero. This list may be out of date. Consequence: Newer instance types cannot be found in the list. They cannot be scaled from zero instances in a MachineSet. Fix: Update the list to include newer instance types. Result: The newer instance types may now be scaled from zero.
Clone Of: 1896321
Environment:
Last Closed: 2021-02-24 15:54:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-azure pull 192 0 None closed Bug 1917838: Updating Azure VMSize list from autoscaler. 2021-02-12 05:00:33 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:54:59 UTC

Description Joel Speed 2021-01-19 13:55:40 UTC
Using this to update the generated list from Azure specifically, a longer term solution is still required

+++ This bug was initially created as a clone of Bug #1896321 +++

Description of problem:

MachineSet uses a set of annotations to provide source of truth for autoscaling from 0. 

https://github.com/openshift/cluster-api-provider-azure/blob/master/pkg/cloud/azure/actuators/machineset/controller.go#L39-L41

The data for the annotations is gathered from a static list, which becomes outdated over time, providing incorrect estimation of the values or returing nothing for non-listed instance types.

Referenced PR is regenerating these lists, taking the code from upstream autoscaler, and shows the differences in the updated "pkg/actuators/machineset/ec2_instance_types.go" file - https://github.com/openshift/cluster-api-provider-aws/pull/367/files  

Version-Release number of selected component (if applicable):
4.7

How reproducible:

Sometimes

Steps to Reproduce:
1. Create a MachineSet for AWS using p4d.24xlarge instance type
2. Check the annotations on the resource
3. See none and error messages in logs

Actual results:

Scaling from 0 is not available for new instance type, like p4d.24xlarge (AWS)

Expected results:

MachineSet annotation logic should return correct values for any available instance.

Additional info:

--- Additional comment from Joel Speed on 2020-11-13 12:05:00 UTC ---

This is low priority right now as it works for most instance types. We may be able to add a quick fix (including the new types) during the next sprint and look into a proper long term fix at a later date

--- Additional comment from Michael McCune on 2020-12-04 18:53:02 UTC ---

adding UpcomingSprint tag, the team should have good bandwidth to address this after feature freeze.

--- Additional comment from Joel Speed on 2021-01-05 17:21:31 UTC ---

We haven't worked out whether we are going to quick fix this or be able to implement a permanent solution (this will depend on if there is an api for instance types), setting this to target --- so that we triage for future releases

Comment 2 sunzhaohua 2021-01-26 01:53:01 UTC
Verified
clusterversion: 4.7.0-0.nightly-2021-01-25-160335
scale up machineset with instance type Standard_D2as_v4, replicas=0, machine could scale up successful.
$ oc get po
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-default-6b79dffcd9-lcnzz    1/1     Running   0          31m
cluster-autoscaler-operator-76d45449c5-lxtzf   2/2     Running   0          5h35m
cluster-baremetal-operator-7d64d4868-sk8q2     1/1     Running   0          5h35m
machine-api-controllers-bd978fb9b-59qdt        7/7     Running   0          5h35m
machine-api-operator-7469f45d9b-mkghf          2/2     Running   0          5h35m
scale-up-cc5d548d6-282vq                       0/1     Pending   0          4s
scale-up-cc5d548d6-2b95w                       0/1     Pending   0          4s
scale-up-cc5d548d6-4gklr                       0/1     Pending   0          4s
scale-up-cc5d548d6-6c2g6                       0/1     Pending   0          4s
scale-up-cc5d548d6-6vcph                       1/1     Running   0          30m
scale-up-cc5d548d6-72bjj                       0/1     Pending   0          4s
scale-up-cc5d548d6-8b88k                       0/1     Pending   0          4s
scale-up-cc5d548d6-8cpdn                       0/1     Pending   0          4s
scale-up-cc5d548d6-8nr8p                       0/1     Pending   0          4s
scale-up-cc5d548d6-9fklb                       0/1     Pending   0          4s
scale-up-cc5d548d6-cn8z2                       0/1     Pending   0          4s
scale-up-cc5d548d6-dvqcx                       0/1     Pending   0          4s
scale-up-cc5d548d6-dwv4g                       0/1     Pending   0          4s
scale-up-cc5d548d6-ggk2x                       0/1     Pending   0          4s
scale-up-cc5d548d6-kd25m                       1/1     Running   0          30m
scale-up-cc5d548d6-l8pvc                       0/1     Pending   0          4s
scale-up-cc5d548d6-ll7zf                       0/1     Pending   0          4s
scale-up-cc5d548d6-qftvn                       0/1     Pending   0          4s
scale-up-cc5d548d6-wj5dt                       0/1     Pending   0          4s
scale-up-cc5d548d6-xk6kp                       0/1     Pending   0          4s

$ oc get machine
NAME                                           PHASE     TYPE               REGION       ZONE   AGE
zhsunazure251-w72th-master-0                   Running   Standard_D8s_v3    westeurope   3      6h42m
zhsunazure251-w72th-master-1                   Running   Standard_D8s_v3    westeurope   2      6h42m
zhsunazure251-w72th-master-2                   Running   Standard_D8s_v3    westeurope   1      6h42m
zhsunazure251-w72th-worker-westeurope1-vdtp5   Running   Standard_D2s_v3    westeurope   1      6h36m
zhsunazure251-w72th-worker-westeurope2-w98kv   Running   Standard_D2s_v3    westeurope   2      6h36m
zhsunazure251-w72th-worker-westeurope3-6mxqq   Running   Standard_D2as_v4   westeurope   3      25m
zhsunazure251-w72th-worker-westeurope3-gxhjn   Running   Standard_D2as_v4   westeurope   3      25m
zhsunazure251-w72th-worker-westeurope3-hk2qp   Running   Standard_D2as_v4   westeurope   3      25m

Comment 5 errata-xmlrpc 2021-02-24 15:54:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.