1880930 – ClusterAutoscaler not scaling when there are pending pods - only after increasing number of replicas massively

Bug 1880930 - ClusterAutoscaler not scaling when there are pending pods - only after increasing number of replicas massively

Summary: ClusterAutoscaler not scaling when there are pending pods - only after increa...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-21 07:38 UTC by Simon Reber
Modified:	2024-03-25 16:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-30 15:33:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5586281	0	None	None	None	2020-11-19 09:10:52 UTC

Description Simon Reber 2020-09-21 07:38:43 UTC

Description of problem:

Issue was reported that ClusterAutoscaler would not trigger provisioning of additional OpenShift - Node(s) even though pods were in pending state, waiting for additional capacity.

This did remain that way until the deployment was scaled to 40 replicas. After that ClusterAutoscaler triggered provisioning of Nodes to satisfy the needs from the pods.

Question though is, why did it not trigger the provisioning earlier, when 2 or more pods were pending and waiting for capacity.

Checked all deployment details, including matching node-selector, taints, MachineSet configruation, etc. and did not find any culprit.

Also if something like node-selector or similar would have caused the issue it would not trigger the scaling when the replica count is increased by 40. 

Version-Release number of selected component (if applicable):

 - 4.4.17

How reproducible:

 - N/A

Steps to Reproduce:
1. N/A

Actual results:

The ClusterAutoscaler would not trigger provisioning of additional OpenShift - Node(s), even though pods were in pending state, waiting for resources.

Expected results:

Additional OpenShift - Nodes to be provisioned when pending pods are in pending state, waiting for resources and are matching the criteria for auto-scaling.

Additional info:

Comment 6 Alberto 2020-09-21 09:38:36 UTC

I can't find machineAutoscaler/clusterAutoscaler resources in the must gather logs.
The cluster autoscaler logs doesn't show any nginx-deployment-* pod but rather one named nginx-app-1-build. Scale up cycles all are returning with:
"2020-09-16T08:58:21.943253101Z I0916 08:58:21.943194       1 scale_up.go:423] No expansion options"

How long was it since you scaled up to 40 replicas?

Comment 19 Joel Speed 2020-10-29 10:30:15 UTC

I've been working on an issue that seems to be very similar to this one, https://bugzilla.redhat.com/show_bug.cgi?id=1891551

Is it possible that while the scaling up decisions are being made, that the nodes in the node groups that would have matched the pods, were all unhealthy?
If they were unhealthy, then the autoscaler discards them and assumes that there will be a healthy node within the node group, or that no node in the node group is every healthy so won't scale up

Note You need to log in before you can comment on or make changes to this bug.