Bug 1872659 - ClusterAutoscaler doesn't scale down when a node is not needed anymore
Summary: ClusterAutoscaler doesn't scale down when a node is not needed anymore
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Steven Hardy
QA Contact: Daniel
URL:
Whiteboard:
Depends On: 1853267 1901040 1909682
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-26 10:23 UTC by Daniel
Modified: 2021-07-27 22:33 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:32:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:33:31 UTC

Description Daniel 2020-08-26 10:23:53 UTC
Description of problem:
ClusterAutoscaler doesn't scale down when a node is not needed anymore

Version-Release number of selected component (if applicable):
Client Version: 4.6.0-0.nightly-2020-08-26-032807
Server Version: 4.6.0-0.nightly-2020-08-26-032807
Kubernetes Version: v1.19.0-rc.2+aaf4ce1-dirty


How reproducible:
Every time

Steps to Reproduce:
Cluster setup: 2 deployed workers and 1 only provisioned worker
* Full instruction are in the test case attached.

1. Create a new bmh and wait for it be to be in a "Ready" state
2. Create a new ClusterAutoscaler
3. Create a new MachineAutoscaler
4. Create a stress deployment(custom image which runs the "stress" tool to simulate memory pressure) that requests 6500Mi of memory for each container(So that each node could only have 1 pod running), and then create 1 pod from this deployment.
5. Scale the stress deployment to 3 replicas pods
6. Two pods should be running(1 on each worker), but 1 application pod is pending because the cluster does not have enough resources to schedule it
7. After some time a new machine should be created, bmh should become provisioned and a new worker should be created. Now all 3 pods are running (1 on each of the workers).
8. Scale the stress deployment to 2 replicas pods

Actual results:
The newly created pod is deleted, but even after waiting more then an hour the machineset doesn't scale down.
* scale-down-delay-after-add: This parameter is responsible on how long after scale up that scale down evaluation resumes (default to 10 min)
* scale-down-unneeded-time: How long a node should be unneeded before it is eligible for scale down (default to 10 min)


Expected results:
Based on the scale-down-delay-after-add and scale-down-unneeded-time, the machineset should scale down when the new node is not needed anymore.

Additional info:

Comment 1 Michael Gugino 2020-08-28 19:00:23 UTC
Need must gather.

Comment 3 Daniel 2020-09-15 06:48:40 UTC
I deployed a cluster with 3 running workers with no pods running on them. I created a ClusterAutoscaler and ClusterAutoscaler which are configured to start scaling down after 30 seconds of unneeded time.
The minimum replicas i specified is 1. So since there are 0 load on the workers i expected it to start scaling down.
Instead, When looking in the cluster-autoscaler pod i noticed a log message saying "ignoring 3 nodes unremovable".
Link to must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1872659.tar.gz
ClusterAutoscaler and MachineAutoscaler manifests are available at https://gist.github.com/dmaizel/17be497171e9a46a4ccc66da7ac8c5a5:

Comment 4 Michael McCune 2020-09-15 16:16:42 UTC
just wanted to drop a comment here, i looked through the must gather info and i didn't see anything that immediately stood out as incorrect with the ClusterAutoscaler or MachineAutoscaler (although it looks like the resource manifests are different than the ones linked in the gist). i also looked through the cluster-autoscaler logs and the Machine and MachineSet resources, nothing looked suspicious to me.

@Daniel would you be able to re-run these tests but increase the verbosity of the autoscaler logs by injecting the environment variable `CLUSTER_AUTOSCALER_VERBOSITY=5` into the cluster-autoscaler deployment?

this would allow us to see the utilization metrics, and other internal info, about how the scaler is deciding to skip those machines. i'm curious if there might be some discrepancy between what we consider to be no load on a machine versus what the autoscaler thinks. it appears to be failing because it thinks those machines' utilization is above the minimum threshold.

Comment 5 Daniel 2020-09-16 09:53:23 UTC
I0916 08:39:43.976179       1 static_autoscaler.go:449] Calculating unneeded nodes
I0916 08:39:43.976284       1 pre_filtering_processor.go:57] Skipping worker-0-2 - no node group config
I0916 08:39:43.976325       1 pre_filtering_processor.go:57] Skipping master-0-0 - no node group config
I0916 08:39:43.976375       1 pre_filtering_processor.go:57] Skipping master-0-1 - no node group config
I0916 08:39:43.976409       1 pre_filtering_processor.go:57] Skipping master-0-2 - no node group config
I0916 08:39:43.976551       1 pre_filtering_processor.go:57] Skipping worker-0-0 - no node group config
I0916 08:39:43.976654       1 pre_filtering_processor.go:57] Skipping worker-0-1 - no node group config

Comment 6 Steven Hardy 2020-09-16 10:30:58 UTC
I traced through the autoscaler code and it seems like the reason for that error is likely to be either that the node/machine ProviderID doesn't match, or the machine ownerReference for the machineset is missing, however both seem OK AFAICS:

(which we fixed already)

$ oc get machine -o json ostest-jmgbm-worker-0-r7qrm | jq .metadata.name,.metadata.ownerReferences,.spec.providerID
"ostest-jmgbm-worker-0-r7qrm"
[
  {
    "apiVersion": "machine.openshift.io/v1beta1",
    "blockOwnerDeletion": true,
    "controller": true,
    "kind": "MachineSet",
    "name": "ostest-jmgbm-worker-0",
    "uid": "f0a8ae43-6089-46db-8a28-10e4c7cd0be9"
  }
]
"baremetalhost:///openshift-machine-api/ostest-worker-1"

$ oc get machineset
NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
ostest-jmgbm-worker-0   3         3         3       3           20h

$ oc get node worker-1 -o json | jq .spec.providerID
"baremetalhost:///openshift-machine-api/ostest-worker-1"

Comment 11 Daniel 2021-02-10 13:32:56 UTC
Not an issue anymore.

Comment 15 errata-xmlrpc 2021-07-27 22:32:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.