Bug 1835160

Summary:	MachineSets: missing replicas in spec breaks autoscaler
Product:	OpenShift Container Platform	Reporter:	Joel Speed <jspeed>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	agarcial, deads, hongkliu, jhou, jspeed, mgugino, zhsun
Version:	4.4
Target Milestone:	---
Target Release:	4.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: A MachineSet can have a nil replicas field Consequence: The autoscaler cannot determine the size of the MachineSet as it currently is Fix: Allow the autoscaler to read the replica count from the status replica field Result: The autoscaler will always be able to determine the current size of a MachineSet	Story Points:	---
Clone Of:	1820654	Environment:
Last Closed:	2020-08-18 10:08:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1820654, 1852061
Bug Blocks:

Comment 3 sunzhaohua 2020-06-29 06:01:55 UTC

Failed to verify
clusterversion: 4.4.0-0.nightly-2020-06-27-171816

steps: 
1. Edit machineset zhsun62944-fvwvq-worker-us-east-2c with "replicas: "
# oc get machineset
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsun62944-fvwvq-worker-us-east-2a   1         1         1       1           7h8m
zhsun62944-fvwvq-worker-us-east-2b   1         1         1       1           7h8m
zhsun62944-fvwvq-worker-us-east-2c             1         1       1           7h8m
2. Create clusterautoscaler	
3. Create machineautoscale with machineset	zhsun62944-fvwvq-worker-us-east-2c
# oc get machineautoscaler
NAME       REF KIND     REF NAME                             MIN   MAX   AGE
worker-c   MachineSet   zhsun62944-fvwvq-worker-us-east-2c   1     3     2m34s
4. Check the autoscale logs, 

E0629 05:58:58.268468       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found
I0629 05:58:58.269214       1 scale_down.go:776] No candidates for scale down
I0629 05:59:08.282893       1 static_autoscaler.go:343] No unschedulable pods
E0629 05:59:08.283011       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found


5. Create workload to scale up the cluster, autoscaler couldn't scale up

I0629 05:52:06.869902       1 scale_up.go:271] Pod openshift-machine-api/scale-up-788b7f7c75-dg8mh is unschedulable
I0629 05:52:06.869911       1 scale_up.go:271] Pod openshift-machine-api/scale-up-788b7f7c75-6h7k4 is unschedulable
I0629 05:52:06.871060       1 scale_up.go:423] No expansion options
E0629 05:52:06.871367       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found

Comment 4 Joel Speed 2020-06-29 10:04:36 UTC

I've just taken a look at the codepaths that are being followed to lead to the log line `pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found` and I think this is an unrelated issue to this bug

That log comes from here https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/processors/nodes/pre_filtering_processor.go#L60-L64, checking the nodegroup ID as a key in a map

That map come from https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/utils/utils.go#L26-L38, which suggests the only reason the group size would not be present is if there is an error in `TargetSize()`, except that in our implementation, it's impossible for that method to return an error

I'll spin up a cluster and see if I can reproduce this

Comment 5 Joel Speed 2020-06-29 11:55:06 UTC

Ok, I've found the issue, the bug isn't fully resolved, but it is hidden in the later releases by the autoscaling from zero work that was done.

https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/cloudprovider/openshiftmachineapi/machineapi_controller.go#L365 This line requires that the replicas be set, otherwise it will fail
https://github.com/openshift/kubernetes-autoscaler/blob/4abdca547be45a251ba33486324f0cac8664ca25/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go#L510 In the master branch, because this bug was verified on a platform that supports scaling from zero, this was hidden

Comment 6 Joel Speed 2020-06-29 16:42:22 UTC

We've decided we need to revert this as the fix is not complete for 4.4. We will complete the fix in master and start the backport process, but it is unlikely to be worth completing for 4.4 by the time it is backported to 4.5

Comment 8 Joel Speed 2020-08-18 10:08:33 UTC

This is fixed in newer releases and is of low priority and low impact to users. It is not worth back porting this at this point.