Bug 1820654

Summary:	MachineSets: missing replicas in spec breaks autoscaler
Product:	OpenShift Container Platform	Reporter:	Michael Gugino <mgugino>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	agarcial, deads, hongkliu, jhou, jspeed, mgugino
Version:	4.3.0	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The replicas field in the spec of a MachineSet can be set to a nil value Consequence: The autoscaler cannot determine the current number of replicas within the MachineSet and therefore cannot perform any scaling operations Fix: If the replicas field is not set within the spec, fall back to the replicas field within the status which comes from the last number of observed replicas by the MachineSet Result: The autoscaler can make scaling decisions when the spec replicas field is nil, assuming the MachineSet controller has recently synced the number of replicas to the status	Story Points:	---
Clone Of:	1819029
Clones:	1827307 1835160 1852061 (view as bug list)		Environment:
Last Closed:	2020-08-17 20:05:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1852061
Bug Blocks:	1827307, 1835160

Comment 1 Michael Gugino 2020-04-03 14:23:32 UTC

Problem:  Cluster Autoscaler depends on reading a value from MachineSet.Spec.Replicas to determine current count.  Since we cannot enforce validation against this field being present, we should fall back to reading Replicas from MachineSet.Status.Replicas if it's unset in Spec.

Comment 2 Joel Speed 2020-04-06 13:15:54 UTC

Adding some context about "cannot enforce validation" to this description:

CRDs do support validation if they are structural, things like minimum or maximum values can be set, or defaults if the user has not set a value.

The MachineSet CRD is not currently structural, but will be once https://bugzilla.redhat.com/show_bug.cgi?id=1769004 is fixed (PR Open for this)

Once the MachineSet CRDs are structural, we should be able to set `// +kubebuilder:default=1` on the `replicas` field, however, there is currently a bug within the Kubernetes OpenAPI implementation and defaulting the replicas field within an object with a scale subresource, which prevents us from doing this (see https://github.com/kubernetes/kubernetes/pull/89833 for more info)

Comment 3 Michael Gugino 2020-04-06 13:23:04 UTC

Regardless of whether or not we can default or enforce the field on a technical level, we have released the API with the field behavior as-is.  This particular problem was discovered because the user had automation to maintain the state of an object (eg, make sure it's present, but don't set the replica count).  If we add any defaulting or setting of the spec, there maybe external actors enforcing the state of the spec and we'll enter some kind of dueling controller problem.

I propose we leave the field as-is, and just add the ability to the autoscaler to read from the status if spec field is not present.

Comment 4 Joel Speed 2020-04-06 14:20:16 UTC

> This particular problem was discovered because the user had automation to maintain the state of an object (eg, make sure it's present, but don't set the replica count).

Having read through the slack conversation, this was not the impression that I got.
The problem of not being able to scale was suggested, towards the end of the conversation, to be due to insufficient resources available based on the disks they had requested and the cloud provider having the right quota.

Currently if the autoscaler cannot get a value from the replicas field (because `replicas` is nil), it returns that there are 0 replicas, in this case, if scale from zero were enabled, then autoscaling could have continued as expected (ie shortly, this wouldn't block autoscaling as is), but currently, as mentioned this means autoscaling would be blocked.

That aside, I would suggest that, instead of relying on the status field, which has a different inference (as mentioned in the slack conversation), that the behaviour of the autoscaler should match what the MachineSet controller would do if it saw a nil value for the `replicas` field, which, as per the docs, would be to expect the MachineSet to have 1 replica. (https://github.com/openshift/machine-api-operator/blob/79ccf91b68eda5ae352e6dca9d2382f161f9b9e7/pkg/apis/machine/v1beta1/machineset_types.go#L57)
So that would mean a change to the autoscalers expectation from `0` to `1` replica.

However, the docs are wrong. The actual behaviour of the MachineSet controller is that it will not reconcile a MachineSet if the replica field is nil (https://github.com/openshift/machine-api-operator/blob/79ccf91b68eda5ae352e6dca9d2382f161f9b9e7/pkg/controller/machineset/controller.go#L268-L270) and instead returns an error, requeueing the MachineSet.
The fact that the MachineSet controller considers this an error, suggests to me that the replicas field being set to nil is an error and a fundamentally broken situtation, that we shouldn't be catering for users manually doing this, especially as setting it to nil if it's already been set requires explicitly patching to drop the field.

Given that a nil replicas is an error state already within our controllers, changing the behaviour of the field to have a default value would not, I believe, be a breaking change as it would not change the behaviour of any of our controllers.
Also, given that Kubectl Apply's three way merging accounts for OpenAPI defaulting, anyone programatically `kubectl apply -f machineset.yaml`ing with an empty replicas field, as is done by the team who found this problem, would again not be affected by a change to add an OpenAPI default.

An intermediate solution to this problem until OpenAPI validation is fixed, would be to introduce a mutating webhook to default the field, though this would require a reasonable amount of effort

Comment 5 Michael Gugino 2020-04-06 14:45:22 UTC

We released machineset.spec.replicas as an optional field: https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/machine/v1beta1/machineset_types.go#L58-L59

The machineset controller should act on the replicas in the spec.  If replicas is empty, it should do nothing.  That API is the source of truth, not the controller implementation.

> Currently if the autoscaler cannot get a value from the replicas field (because `replicas` is nil), it returns that there are 0 replicas, in this case, if scale from zero were enabled, then autoscaling could have continued as expected (ie shortly, this wouldn't block autoscaling as is), but currently, as mentioned this means autoscaling would be blocked.

There were a couple of issues going on in this case.  After we sorted out the issues with the CPU and Memory limits, the autoscaler wouldn't scale.  This was due to the fact that it was unable to determine the size of the current replica count.  The replicas were actually above 0 at the time, so it was not a scale from zero issue.

Changing replicas from optional to mandatory and updating controller logic would require an API migration.  Upstream also treats this field as optional: https://github.com/kubernetes-sigs/cluster-api/blob/master/api/v1alpha3/machineset_types.go#L38

Comment 6 Alberto 2020-04-08 08:23:12 UTC

> Problem:  Cluster Autoscaler depends on reading a value from MachineSet.Spec.Replicas to determine current count.  Since we cannot enforce validation against this field being present, we should fall back to reading Replicas from MachineSet.Status.Replicas if it's unset in Spec.

> This was due to the fact that it was unable to determine the size of the current replica count.  The replicas were actually above 0 at the time, so it was not a scale from zero issue.

Can you elaborate the details on the above adding code references?

1 - To me seems that for the autoscaler a nil replicas is the same than a replicas=0. And this assumption seems reasonable to me.
https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_machineset.go#L66
https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go#L508

2 - The fact that the replica field is a pointer is a mistake to me. I can't think of any scenario where you want to differentiate between nil and zero. Anyone explicitly setting this to nil today is broken as the controller won't reconcile and we should prevent this from happening.

My suggestions would be:
1 - Go through the steps that followed the user behind this ticket and see where we can improve autoscaler logging.
2 - Prevent users from persisting replica nil to disk by defaulting to either 0 or 1. My only concern defaulting to 1 is that It would deviate behaviour for users explicitly setting this to nil on creation, I would consider this a bug fix though.

Thoughts?

Comment 7 Michael Gugino 2020-04-08 13:44:19 UTC

> Anyone explicitly setting this to nil today is broken as the controller won't reconcile and we should prevent this from happening.

They're not necessarily broken completely.  The scenario that could be playing out is

1) I want some automation to ensure I have a particular machineset object defined in the cluster, without specifying the number of replicas.

2) I will scale the machineset as I see fit.  That will set the number of replicas, the machineset will reconcile to that number.

3) After reconciliation, we have new machines as desired.

4) Periodic automation removes replicas again because it detects a diff between what's in the cluster and what's defined in the source of truth.

In this scenario, assuming that a replica being nil == 0 is bad.  This scenario is essentially what our CI cluster wants to do, but the autoscaler breaks when no replica count is found.  We confirmed with them that setting the spec.replicas to the actual value (matching what's in status) instantly fixed the autoscaler.


Since we specified this field as optional in the API, we should not change it arbitrarily.  That might break users that were otherwise happy with how things were working.  Also, upstream the field is optional as well, and I don't want to deviate from that if at all possible.

If we convince upstream to go non-optional on that field, I think we can entertain it in the future.  But for now, just fix the autoscaler and the machine-controller (just tell machine-controller to noop if replicas is nil)

Comment 8 Joel Speed 2020-04-08 14:09:37 UTC

> 1) I want some automation to ensure I have a particular machineset object defined in the cluster, without specifying the number of replicas.

The idea here being that the controller would create the machine with a nil value, so no machines get created, then maybe the autoscaler or manually it is scaled later?
Setting a default of 1 would slightly change this behaviour I agree, a machine would be created and then the autoscaler or manual action would need to scale it. It's not impossible to automate that creation and setting, even when using kubectl/kubectl apply.
Do you think there are a lot of people who do this? Creating machinesets and constantly reconciling them with a controller?

> 4) Periodic automation removes replicas again because it detects a diff between what's in the cluster and what's defined in the source of truth.

Assuming the MachineSet controller is bug free, this should never happen, the MachineSet controller is responsible for ensuring the number of replicas specified in the spec matches the number of machines in the cluster.
If someone is relying on this bug to pause reconciliation then they're leveraging what we would consider a bug no? I would not expect 99.9% of users to leverage a bug like this as this is not the behaviour we promise from our MachineSet functionality.

> Also, upstream the field is optional as well, and I don't want to deviate from that if at all possible.

Just wanted to add, upstream don't let you set the value to nil, they have since v1alpha1 had a defaulting webhook that sets the value if it comes through as nil, so our behaviour is currently different from theirs and has always been that way.

https://github.com/kubernetes-sigs/cluster-api/blob/b1b35203e680c02a7a707e30927224812aeb122f/api/v1alpha3/machineset_webhook.go#L51-L53

Comment 9 Michael Gugino 2020-04-08 15:04:31 UTC

> Do you think there are a lot of people who do this? Creating machinesets and constantly reconciling them with a controller?

More than none.  I know our internal users (dedicated) are most likely doing it (not just the CI environment) as they have a lot of config enforcement stuff as well.

> If someone is relying on this bug to pause reconciliation then they're leveraging what we would consider a bug no? 

Debatable.  The field is optional, so by definition, the field being nil is not a bug.  We didn't have any defined spec for the feature of what happens when that field is nil, so I believe we should maintain current behavior.  Having the behavior I'm suggesting impacts 0 people in a negative way.  If you want 0 machines, set replicas to 0 machines.

> Just wanted to add, upstream don't let you set the value to nil, they have since v1alpha1 had a defaulting webhook that sets the value if it comes through as nil, so our behaviour is currently different from theirs and has always been that way.

Okay, I didn't realize that.  Looks like CRDs support defaulting in 1.17, so we can utilize that instead of a webhook going forward.

My position is this would require an API migration (eg, v1beta2) to handle properly.

Comment 10 David Eads 2020-04-08 17:48:20 UTC

Some observations and positions
1. CRD scaling subresources on spec write, missing is allowed, but it is treated zero: https://github.com/kubernetes/kubernetes/blob/3c3fc800df015955d7c38f7b39c94f3d261fd8d7/staging/src/k8s.io/apiextensions-apiserver/pkg/registry/customresource/validator.go#L149-L156
2. CRD scaling subresources on scale read, missing results in an error: https://github.com/kubernetes/kubernetes/blob/3c3fc800df015955d7c38f7b39c94f3d261fd8d7/staging/src/k8s.io/apiextensions-apiserver/pkg/registry/customresource/etcd.go#L229-L235
3. I think we want this API to eventually participate in scale subresources
4. Zero does not work well for machineset updates, you cannot scale back up. Don't break users.
5. having the field optional allowed a client to use Updates to set all the field values except .spec.replicas.

Any change we make will be breaking for writers, but doesn't have to be for readers. Consider one possible migration path:
1. Force create and update to have replicas set to a value >=0. This is a new validation rule, enforced by admission if need be. This gives parity with upstream scale validation on spec.
2. In controllers we have setting .spec.replicas (do we have any?), if .spec.replicas is missing, set it to .status.replicas. This ensures that our controllers can continue to update.
3. In admission, on create, set a missing replicas value to 0. This is ok because nothing existed before. They don't have machines we're scaling down.
4. In admission, on update, set a missing replicas value to the previous .spec.replicas if it existed. If the value of .spec.replicas didn't exist, set to .status.replicas to avoid scaling down. If there isn't a .status.replicas, set to 1. Setting to one here ensures that we never scale them down a point of not being able to come back up.

This causes the following effects:
1. A user who was using this "feature" to trigger a calculation like: if no .spec.replicas, decide what the new size should be will need to find a new trigger. This person should probably have done found a better trigger anyway.
2. A user who was using this "feature" to be able to update the rest of spec without affecting replicas will still be able to do this effectively.

Conclusions:
1. Unless there is a strong driver I would not choose to make this behavior change in a z-stream
2. If we choose to tighten, we should immediately, in 4.4, issue a release-note stating out intent to fix the bug in a future release.
3. We should attempt to discover clusters who were leveraging this "feature". I think we could determine this by seeing a .spec.replicas transition from having a value -> becoming empty -> having a value that doesn't match .status.replicas. Wire to an alert and plumb through telemetry and we'll have a list of customers to contact. If we don't see transitions like that, then I think the migration described above should be equivalent for existing users.
4. This is complicated enough to deserve an enhancement to describe what drives us to this decision, what benefit we get, how we mitigate for users, how users can react, and how we found these users.

Wow, this response got way too long

Comment 11 Michael Gugino 2020-04-08 18:41:24 UTC

I think this is overboard.  One or two lines of code in the autoscaler and the machineset controller vs doing all that.

I'm not at all a fan of having an optional field that we default to an integer.  There's not really a compelling reason to do this.  This seems like making a change for change's sake.

Comment 12 Joel Speed 2020-04-22 08:57:36 UTC

We had a conversation regarding this topic out of band and I believe came to the following decision:

Firstly, we will make a small PR to fix the autoscaler in the short term with a proviso/TODO that the code should eventually be removed + warn users that we are inferring the value of replicas and that this is "not supported"

Secondly, start the process of investigating changing the value to a default via openapi. 
- Collect telemetry to determine how many users this change might break
- Warn users that using a nil replicas is unsupported
- Write up an enhancement to explain the change and the motivation for doing so
- Add a release note to current notes stating intention to change the behaviour
- Aim to add defaulting into the 4.y stream release (release tbd)

(Pretty much as described by David https://bugzilla.redhat.com/show_bug.cgi?id=1820654#c10)

As I see it, the main reasons for setting the default:
- Other resources using scale subresource default to 1 (eg deployments, replicasets, upstream cluster api machinedeployments/machinesets), we should aim to be consistent with their behaviour
- Ideally behaviour should match the rest of the ecosystem to reduce surprises for users
- Our docs state that the field defaults to 1
- Convergence with behaviour of upstream cluster-api will reduce differences for customers as we try to reach parity
- Prevents customers unknowingly having an "invalid" value that doesn't make sense to controllers/block controller behaviour
- We won't have to infer the replica count in the autoscaler and will have a concrete value (there may be reasons that the spec.replicas and status.replicas differ?)

We could aim for the 4.5 release but given this has taken some time, plus if we want to collect telemetry around this topic, I think we will likely include this in the 4.6 release

Comment 15 sunzhaohua 2020-05-12 03:44:00 UTC

Verified
clusterversion: 4.5.0-0.nightly-2020-05-10-231314

steps: 
1. Create a new machineset with replicas is nil
2. Create clusterautoscaler	
3. Create machineautoscale with the new machineset	
4. Check the autoscale logs, use the status.replicas field instead if the spec.replicas field is not present

W0512 03:11:22.099912       1 clusterapi_machineset.go:77] MachineSet "zhsunazure511-4grc9-worker-westus1" has nil spec.replicas. This is unsupported behaviour. Falling back to status.replicas.

5. Create workload to scale up the cluster, autoscaler shouldn't be break.

W0512 03:12:09.587564       1 clusterapi_machineset.go:77] MachineSet "zhsunazure511-4grc9-worker-westus1" has nil spec.replicas. This is unsupported behaviour. Falling back to status.replicas.
I0512 03:12:09.617210       1 scale_up.go:452] Best option to resize: openshift-machine-api/zhsunazure511-4grc9-worker-westus1
I0512 03:12:09.617237       1 scale_up.go:456] Estimated 20 nodes needed in openshift-machine-api/zhsunazure511-4grc9-worker-westus1
W0512 03:12:09.784997       1 clusterapi_machineset.go:77] MachineSet "zhsunazure511-4grc9-worker-westus1" has nil spec.replicas. This is unsupported behaviour. Falling back to status.replicas.
W0512 03:12:09.983178       1 clusterapi_machineset.go:77] MachineSet "zhsunazure511-4grc9-worker-westus1" has nil spec.replicas. This is unsupported behaviour. Falling back to status.replicas.
I0512 03:12:09.983218       1 scale_up.go:570] Final scale-up plan: [{openshift-machine-api/zhsunazure511-4grc9-worker-westus1 0->3 (max: 3)}]
I0512 03:12:09.983248       1 scale_up.go:659] Scale-up: setting group openshift-machine-api/zhsunazure511-4grc9-worker-westus1 size to 3

6. Check machine and node	
$ oc get machine
NAME                                       PHASE     TYPE              REGION   ZONE   AGE
zhsunazure511-4grc9-master-0               Running   Standard_D8s_v3   westus          25h
zhsunazure511-4grc9-master-1               Running   Standard_D8s_v3   westus          25h
zhsunazure511-4grc9-master-2               Running   Standard_D8s_v3   westus          25h
zhsunazure511-4grc9-worker-westus-cwfvd    Running   Standard_D2s_v3   westus          25h
zhsunazure511-4grc9-worker-westus-fq5s8    Running   Standard_D2s_v3   westus          25h
zhsunazure511-4grc9-worker-westus-nxhlw    Running   Standard_D2s_v3   westus          25h
zhsunazure511-4grc9-worker-westus1-87w5j   Running   Standard_D2s_v3   westus          22m
zhsunazure511-4grc9-worker-westus1-gzprk   Running   Standard_D2s_v3   westus          22m
zhsunazure511-4grc9-worker-westus1-zgs62   Running   Standard_D2s_v3   westus          22m

$ oc get node
NAME                                       STATUS   ROLES    AGE   VERSION
zhsunazure511-4grc9-master-0               Ready    master   25h   v1.18.0-rc.1
zhsunazure511-4grc9-master-1               Ready    master   25h   v1.18.0-rc.1
zhsunazure511-4grc9-master-2               Ready    master   25h   v1.18.0-rc.1
zhsunazure511-4grc9-worker-westus-cwfvd    Ready    worker   25h   v1.18.0-rc.1
zhsunazure511-4grc9-worker-westus-fq5s8    Ready    worker   25h   v1.18.0-rc.1
zhsunazure511-4grc9-worker-westus-nxhlw    Ready    worker   25h   v1.18.0-rc.1
zhsunazure511-4grc9-worker-westus1-87w5j   Ready    worker   16m   v1.18.0-rc.1
zhsunazure511-4grc9-worker-westus1-gzprk   Ready    worker   16m   v1.18.0-rc.1
zhsunazure511-4grc9-worker-westus1-zgs62   Ready    worker   16m   v1.18.0-rc.1

Comment 16 Hongkai Liu 2020-05-12 23:43:13 UTC

Would it be backported to 4.4?

Comment 17 Joel Speed 2020-05-13 09:15:45 UTC

Yes, I will backport this to 4.4.z

Comment 18 Joel Speed 2020-06-29 16:45:46 UTC

We have discovered this isn't entirely fixed, reopening

Comment 22 Milind Yadav 2020-08-10 05:51:14 UTC

VERIFIED on - 4.5.0-0.nightly-2020-08-06-014216

The automation CI  results confirms the fix is applied 

Hence moving to VERIFIED

Comment 24 errata-xmlrpc 2020-08-17 20:05:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3330