Bug 1835160

Summary: MachineSets: missing replicas in spec breaks autoscaler
Product: OpenShift Container Platform Reporter: Joel Speed <jspeed>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: low    
Priority: unspecified CC: agarcial, deads, hongkliu, jhou, jspeed, mgugino, zhsun
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A MachineSet can have a nil replicas field Consequence: The autoscaler cannot determine the size of the MachineSet as it currently is Fix: Allow the autoscaler to read the replica count from the status replica field Result: The autoscaler will always be able to determine the current size of a MachineSet
Story Points: ---
Clone Of: 1820654 Environment:
Last Closed: 2020-08-18 10:08:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1820654, 1852061    
Bug Blocks:    

Comment 3 sunzhaohua 2020-06-29 06:01:55 UTC
Failed to verify
clusterversion: 4.4.0-0.nightly-2020-06-27-171816

steps: 
1. Edit machineset zhsun62944-fvwvq-worker-us-east-2c with "replicas: "
# oc get machineset
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsun62944-fvwvq-worker-us-east-2a   1         1         1       1           7h8m
zhsun62944-fvwvq-worker-us-east-2b   1         1         1       1           7h8m
zhsun62944-fvwvq-worker-us-east-2c             1         1       1           7h8m
2. Create clusterautoscaler	
3. Create machineautoscale with machineset	zhsun62944-fvwvq-worker-us-east-2c
# oc get machineautoscaler
NAME       REF KIND     REF NAME                             MIN   MAX   AGE
worker-c   MachineSet   zhsun62944-fvwvq-worker-us-east-2c   1     3     2m34s
4. Check the autoscale logs, 

E0629 05:58:58.268468       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found
I0629 05:58:58.269214       1 scale_down.go:776] No candidates for scale down
I0629 05:59:08.282893       1 static_autoscaler.go:343] No unschedulable pods
E0629 05:59:08.283011       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found


5. Create workload to scale up the cluster, autoscaler couldn't scale up

I0629 05:52:06.869902       1 scale_up.go:271] Pod openshift-machine-api/scale-up-788b7f7c75-dg8mh is unschedulable
I0629 05:52:06.869911       1 scale_up.go:271] Pod openshift-machine-api/scale-up-788b7f7c75-6h7k4 is unschedulable
I0629 05:52:06.871060       1 scale_up.go:423] No expansion options
E0629 05:52:06.871367       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found

Comment 4 Joel Speed 2020-06-29 10:04:36 UTC
I've just taken a look at the codepaths that are being followed to lead to the log line `pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found` and I think this is an unrelated issue to this bug

That log comes from here https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/processors/nodes/pre_filtering_processor.go#L60-L64, checking the nodegroup ID as a key in a map

That map come from https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/utils/utils.go#L26-L38, which suggests the only reason the group size would not be present is if there is an error in `TargetSize()`, except that in our implementation, it's impossible for that method to return an error

I'll spin up a cluster and see if I can reproduce this

Comment 5 Joel Speed 2020-06-29 11:55:06 UTC
Ok, I've found the issue, the bug isn't fully resolved, but it is hidden in the later releases by the autoscaling from zero work that was done.

https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/cloudprovider/openshiftmachineapi/machineapi_controller.go#L365 This line requires that the replicas be set, otherwise it will fail
https://github.com/openshift/kubernetes-autoscaler/blob/4abdca547be45a251ba33486324f0cac8664ca25/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go#L510 In the master branch, because this bug was verified on a platform that supports scaling from zero, this was hidden

Comment 6 Joel Speed 2020-06-29 16:42:22 UTC
We've decided we need to revert this as the fix is not complete for 4.4. We will complete the fix in master and start the backport process, but it is unlikely to be worth completing for 4.4 by the time it is backported to 4.5

Comment 8 Joel Speed 2020-08-18 10:08:33 UTC
This is fixed in newer releases and is of low priority and low impact to users. It is not worth back porting this at this point.