Bug 1872659

Summary:	ClusterAutoscaler doesn't scale down when a node is not needed anymore
Product:	OpenShift Container Platform	Reporter:	Daniel <dmaizel>
Component:	Cloud Compute	Assignee:	Steven Hardy <shardy>
Cloud Compute sub component:	BareMetal Provider	QA Contact:	Daniel <dmaizel>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	augol, beth.white, dmaizel, jrouth, mgugino, mimccune, rbartal, shardy, stbenjam
Version:	4.6	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:32:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1853267, 1901040, 1909682
Bug Blocks:

Description Daniel 2020-08-26 10:23:53 UTC

Description of problem:
ClusterAutoscaler doesn't scale down when a node is not needed anymore

Version-Release number of selected component (if applicable):
Client Version: 4.6.0-0.nightly-2020-08-26-032807
Server Version: 4.6.0-0.nightly-2020-08-26-032807
Kubernetes Version: v1.19.0-rc.2+aaf4ce1-dirty

How reproducible:
Every time

Steps to Reproduce:
Cluster setup: 2 deployed workers and 1 only provisioned worker
* Full instruction are in the test case attached.

1. Create a new bmh and wait for it be to be in a "Ready" state
2. Create a new ClusterAutoscaler
3. Create a new MachineAutoscaler
4. Create a stress deployment(custom image which runs the "stress" tool to simulate memory pressure) that requests 6500Mi of memory for each container(So that each node could only have 1 pod running), and then create 1 pod from this deployment.
5. Scale the stress deployment to 3 replicas pods
6. Two pods should be running(1 on each worker), but 1 application pod is pending because the cluster does not have enough resources to schedule it
7. After some time a new machine should be created, bmh should become provisioned and a new worker should be created. Now all 3 pods are running (1 on each of the workers).
8. Scale the stress deployment to 2 replicas pods

Actual results:
The newly created pod is deleted, but even after waiting more then an hour the machineset doesn't scale down.
* scale-down-delay-after-add: This parameter is responsible on how long after scale up that scale down evaluation resumes (default to 10 min)
* scale-down-unneeded-time: How long a node should be unneeded before it is eligible for scale down (default to 10 min)

Expected results:
Based on the scale-down-delay-after-add and scale-down-unneeded-time, the machineset should scale down when the new node is not needed anymore.

Additional info:

Comment 1 Michael Gugino 2020-08-28 19:00:23 UTC

Need must gather.

Comment 3 Daniel 2020-09-15 06:48:40 UTC

I deployed a cluster with 3 running workers with no pods running on them. I created a ClusterAutoscaler and ClusterAutoscaler which are configured to start scaling down after 30 seconds of unneeded time.
The minimum replicas i specified is 1. So since there are 0 load on the workers i expected it to start scaling down.
Instead, When looking in the cluster-autoscaler pod i noticed a log message saying "ignoring 3 nodes unremovable".
Link to must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1872659.tar.gz
ClusterAutoscaler and MachineAutoscaler manifests are available at https://gist.github.com/dmaizel/17be497171e9a46a4ccc66da7ac8c5a5:

Comment 4 Michael McCune 2020-09-15 16:16:42 UTC

just wanted to drop a comment here, i looked through the must gather info and i didn't see anything that immediately stood out as incorrect with the ClusterAutoscaler or MachineAutoscaler (although it looks like the resource manifests are different than the ones linked in the gist). i also looked through the cluster-autoscaler logs and the Machine and MachineSet resources, nothing looked suspicious to me.

@Daniel would you be able to re-run these tests but increase the verbosity of the autoscaler logs by injecting the environment variable `CLUSTER_AUTOSCALER_VERBOSITY=5` into the cluster-autoscaler deployment?

this would allow us to see the utilization metrics, and other internal info, about how the scaler is deciding to skip those machines. i'm curious if there might be some discrepancy between what we consider to be no load on a machine versus what the autoscaler thinks. it appears to be failing because it thinks those machines' utilization is above the minimum threshold.

Comment 5 Daniel 2020-09-16 09:53:23 UTC

I0916 08:39:43.976179       1 static_autoscaler.go:449] Calculating unneeded nodes
I0916 08:39:43.976284       1 pre_filtering_processor.go:57] Skipping worker-0-2 - no node group config
I0916 08:39:43.976325       1 pre_filtering_processor.go:57] Skipping master-0-0 - no node group config
I0916 08:39:43.976375       1 pre_filtering_processor.go:57] Skipping master-0-1 - no node group config
I0916 08:39:43.976409       1 pre_filtering_processor.go:57] Skipping master-0-2 - no node group config
I0916 08:39:43.976551       1 pre_filtering_processor.go:57] Skipping worker-0-0 - no node group config
I0916 08:39:43.976654       1 pre_filtering_processor.go:57] Skipping worker-0-1 - no node group config

Comment 6 Steven Hardy 2020-09-16 10:30:58 UTC

I traced through the autoscaler code and it seems like the reason for that error is likely to be either that the node/machine ProviderID doesn't match, or the machine ownerReference for the machineset is missing, however both seem OK AFAICS:

(which we fixed already)

$ oc get machine -o json ostest-jmgbm-worker-0-r7qrm | jq .metadata.name,.metadata.ownerReferences,.spec.providerID
"ostest-jmgbm-worker-0-r7qrm"
[
  {
    "apiVersion": "machine.openshift.io/v1beta1",
    "blockOwnerDeletion": true,
    "controller": true,
    "kind": "MachineSet",
    "name": "ostest-jmgbm-worker-0",
    "uid": "f0a8ae43-6089-46db-8a28-10e4c7cd0be9"
  }
]
"baremetalhost:///openshift-machine-api/ostest-worker-1"

$ oc get machineset
NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
ostest-jmgbm-worker-0   3         3         3       3           20h

$ oc get node worker-1 -o json | jq .spec.providerID
"baremetalhost:///openshift-machine-api/ostest-worker-1"

Comment 7 Daniel 2020-09-16 11:41:12 UTC

must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1872659.tar.gz

Comment 11 Daniel 2021-02-10 13:32:56 UTC

Not an issue anymore.

Comment 15 errata-xmlrpc 2021-07-27 22:32:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438