1835160 – MachineSets: missing replicas in spec breaks autoscaler

Bug 1835160 - MachineSets: missing replicas in spec breaks autoscaler

Summary: MachineSets: missing replicas in spec breaks autoscaler

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Joel Speed
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:	1820654 1852061
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-13 09:46 UTC by Joel Speed
Modified:	2020-08-18 10:08 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A MachineSet can have a nil replicas field Consequence: The autoscaler cannot determine the size of the MachineSet as it currently is Fix: Allow the autoscaler to read the replica count from the status replica field Result: The autoscaler will always be able to determine the current size of a MachineSet
Clone Of:	1820654
Environment:
Last Closed:	2020-08-18 10:08:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes-autoscaler pull 153	0	None	closed	[release-4.4] BUG 1835160: Fallback to status if replicas nil in spec	2021-01-27 06:59:25 UTC

Comment 3 sunzhaohua 2020-06-29 06:01:55 UTC

Failed to verify
clusterversion: 4.4.0-0.nightly-2020-06-27-171816

steps: 
1. Edit machineset zhsun62944-fvwvq-worker-us-east-2c with "replicas: "
# oc get machineset
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsun62944-fvwvq-worker-us-east-2a   1         1         1       1           7h8m
zhsun62944-fvwvq-worker-us-east-2b   1         1         1       1           7h8m
zhsun62944-fvwvq-worker-us-east-2c             1         1       1           7h8m
2. Create clusterautoscaler	
3. Create machineautoscale with machineset	zhsun62944-fvwvq-worker-us-east-2c
# oc get machineautoscaler
NAME       REF KIND     REF NAME                             MIN   MAX   AGE
worker-c   MachineSet   zhsun62944-fvwvq-worker-us-east-2c   1     3     2m34s
4. Check the autoscale logs, 

E0629 05:58:58.268468       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found
I0629 05:58:58.269214       1 scale_down.go:776] No candidates for scale down
I0629 05:59:08.282893       1 static_autoscaler.go:343] No unschedulable pods
E0629 05:59:08.283011       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found


5. Create workload to scale up the cluster, autoscaler couldn't scale up

I0629 05:52:06.869902       1 scale_up.go:271] Pod openshift-machine-api/scale-up-788b7f7c75-dg8mh is unschedulable
I0629 05:52:06.869911       1 scale_up.go:271] Pod openshift-machine-api/scale-up-788b7f7c75-6h7k4 is unschedulable
I0629 05:52:06.871060       1 scale_up.go:423] No expansion options
E0629 05:52:06.871367       1 pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found

Comment 4 Joel Speed 2020-06-29 10:04:36 UTC

I've just taken a look at the codepaths that are being followed to lead to the log line `pre_filtering_processor.go:62] Error while checking node group size openshift-machine-api/zhsun62944-fvwvq-worker-us-east-2c: group size not found` and I think this is an unrelated issue to this bug

That log comes from here https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/processors/nodes/pre_filtering_processor.go#L60-L64, checking the nodegroup ID as a key in a map

That map come from https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/utils/utils.go#L26-L38, which suggests the only reason the group size would not be present is if there is an error in `TargetSize()`, except that in our implementation, it's impossible for that method to return an error

I'll spin up a cluster and see if I can reproduce this

Comment 5 Joel Speed 2020-06-29 11:55:06 UTC

Ok, I've found the issue, the bug isn't fully resolved, but it is hidden in the later releases by the autoscaling from zero work that was done.

https://github.com/openshift/kubernetes-autoscaler/blob/2ec541e3e31778428a75e0fc16469ba22d94e4bd/cluster-autoscaler/cloudprovider/openshiftmachineapi/machineapi_controller.go#L365 This line requires that the replicas be set, otherwise it will fail
https://github.com/openshift/kubernetes-autoscaler/blob/4abdca547be45a251ba33486324f0cac8664ca25/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go#L510 In the master branch, because this bug was verified on a platform that supports scaling from zero, this was hidden

Comment 6 Joel Speed 2020-06-29 16:42:22 UTC

We've decided we need to revert this as the fix is not complete for 4.4. We will complete the fix in master and start the backport process, but it is unlikely to be worth completing for 4.4 by the time it is backported to 4.5

Comment 8 Joel Speed 2020-08-18 10:08:33 UTC

This is fixed in newer releases and is of low priority and low impact to users. It is not worth back porting this at this point.

Note You need to log in before you can comment on or make changes to this bug.