Bug 1942966

Summary: MachineSet scaling from 0 is not available or evaluated incorrectly for the new or changed instance types
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: mimccune, skrenger
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-29 04:19:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1896321    
Bug Blocks: 1917838    

Description OpenShift BugZilla Robot 2021-03-25 12:04:31 UTC
+++ This bug was initially created as a clone of Bug #1896321 +++

Description of problem:

MachineSet uses a set of annotations to provide source of truth for autoscaling from 0. 

https://github.com/openshift/cluster-api-provider-azure/blob/master/pkg/cloud/azure/actuators/machineset/controller.go#L39-L41

The data for the annotations is gathered from a static list, which becomes outdated over time, providing incorrect estimation of the values or returing nothing for non-listed instance types.

Referenced PR is regenerating these lists, taking the code from upstream autoscaler, and shows the differences in the updated "pkg/actuators/machineset/ec2_instance_types.go" file - https://github.com/openshift/cluster-api-provider-aws/pull/367/files  

Version-Release number of selected component (if applicable):
4.7

How reproducible:

Sometimes

Steps to Reproduce:
1. Create a MachineSet for AWS using p4d.24xlarge instance type
2. Check the annotations on the resource
3. See none and error messages in logs

Actual results:

Scaling from 0 is not available for new instance type, like p4d.24xlarge (AWS)

Expected results:

MachineSet annotation logic should return correct values for any available instance.

Additional info:

--- Additional comment from jspeed on 2020-11-13 12:05:00 UTC ---

This is low priority right now as it works for most instance types. We may be able to add a quick fix (including the new types) during the next sprint and look into a proper long term fix at a later date

--- Additional comment from mimccune on 2020-12-04 18:53:02 UTC ---

adding UpcomingSprint tag, the team should have good bandwidth to address this after feature freeze.

--- Additional comment from jspeed on 2021-01-05 17:21:31 UTC ---

We haven't worked out whether we are going to quick fix this or be able to implement a permanent solution (this will depend on if there is an api for instance types), setting this to target --- so that we triage for future releases

--- Additional comment from jspeed on 2021-02-08 09:51:55 UTC ---

Since we need to implement a permanent solution to this for all providers, I will convert this work to a Jira card and ensure we create a quick fix in the mean time to update the list of instances

--- Additional comment from jspeed on 2021-03-25 12:01:10 UTC ---

Ive created a JIRA card for tracking the dynamic fetching idea longer term, going to use this BZ for the temporary AWS list update for now

If you want to know the progress of a permanent solution, please see https://issues.redhat.com/browse/OCPCLOUD-1131

Comment 1 sunzhaohua 2021-04-08 01:51:39 UTC
verified after testing on a cluster launched with  cluster-bot
clusterversion: 4.7.0-0.ci.test-2021-04-08-004722-ci-ln-xzssv92

add workload, machineset could scale up from 0 with instanceType: p4d.24xlarge
$ oc get clusterversion
NAME      VERSION                                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.ci.test-2021-04-08-004722-ci-ln-xzssv92   True        False         18m     Cluster version is 4.7.0-0.ci.test-2021-04-08-004722-ci-ln-xzssv92

$ oc get machineautoscaler
NAME                  REF KIND     REF NAME                                      MIN   MAX   AGE
machineautoscaler-b   MachineSet   ci-ln-xzssv92-d5d6b-j75m8-worker-us-east-2b   0     2     2m38s

$ oc get machine
NAME                                                PHASE      TYPE           REGION      ZONE         AGE
ci-ln-xzssv92-d5d6b-j75m8-master-0                  Running    m5.xlarge      us-east-2   us-east-2a   54m
ci-ln-xzssv92-d5d6b-j75m8-master-1                  Running    m5.xlarge      us-east-2   us-east-2b   54m
ci-ln-xzssv92-d5d6b-j75m8-master-2                  Running    m5.xlarge      us-east-2   us-east-2a   54m
ci-ln-xzssv92-d5d6b-j75m8-worker-us-east-2a-v478m   Running    m4.xlarge      us-east-2   us-east-2a   48m
ci-ln-xzssv92-d5d6b-j75m8-worker-us-east-2a-wn82p   Running    m4.xlarge      us-east-2   us-east-2a   48m
ci-ln-xzssv92-d5d6b-j75m8-worker-us-east-2b-t5rwn   Running    p4d.24xlarge   us-east-2   us-east-2b   4m24s

Comment 7 Milind Yadav 2021-06-21 06:38:33 UTC
Pre-merged verified by @Zhsun .. hence moved to VERIFIED.

Comment 10 errata-xmlrpc 2021-06-29 04:19:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2502