1819029 – How to handle alert ClusterAutoscalerUnschedulablePods

Bug 1819029 - How to handle alert ClusterAutoscalerUnschedulablePods

Summary: How to handle alert ClusterAutoscalerUnschedulablePods

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Michael McCune
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1827307
TreeView+	depends on / blocked

Reported:	2020-03-31 01:33 UTC by Hongkai Liu
Modified:	2020-10-27 15:56 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1820654 (view as bug list)
Environment:
Last Closed:	2020-10-27 15:56:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
openshift-machine-api.cluster-autoscaler-default-5476d56447-5ww92.24h.log (5.37 MB, text/plain) 2020-03-31 01:35 UTC, Hongkai Liu	no flags	Details
openshift-machine-api.machine-api-controllers-7c696b9657-m8t4c.machine-controller.24h.log (2.45 MB, text/plain) 2020-03-31 01:36 UTC, Hongkai Liu	no flags	Details
prometheus.query (183.60 KB, image/png) 2020-03-31 01:38 UTC, Hongkai Liu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-autoscaler-operator pull 163	0	None	closed	Bug 1819029: add a document about autoscaler alerts	2021-01-27 06:59:37 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 15:56:56 UTC

Description Hongkai Liu 2020-03-31 01:33:50 UTC

Description of problem:
AlertManager on OCP4.3 cluster (with autoscaler configured) fired this alert in the afternoon todoy.

[FIRING:1] ClusterAutoscalerUnschedulablePods cluster-autoscaler-default (metrics 10.130.0.16:8085 openshift-machine-api cluster-autoscaler-default-5476d56447-5ww92 openshift-monitoring/k8s cluster-autoscaler-default warning)
Cluster Autoscaler has 32 unschedulable pods

https://coreos.slack.com/archives/CHY2E1BL4/p1585600389050900


Version-Release number of selected component (if applicable):
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-23-130439   True        False         6d12h   Cluster version is 4.3.0-0.nightly-2020-03-23-130439

How reproducible:
Got this alert once only

Could I expect no such alerts when autoscaler works?
Or what should I do when seeing the alert?

Additional info:
Will attache the pod logs and prometheus screenshot.

Comment 1 Hongkai Liu 2020-03-31 01:35:40 UTC

Created attachment 1674917 [details]
openshift-machine-api.cluster-autoscaler-default-5476d56447-5ww92.24h.log

Comment 2 Hongkai Liu 2020-03-31 01:36:54 UTC

Created attachment 1674918 [details]
openshift-machine-api.machine-api-controllers-7c696b9657-m8t4c.machine-controller.24h.log

Comment 3 Hongkai Liu 2020-03-31 01:38:02 UTC

Created attachment 1674920 [details]
prometheus.query

Comment 4 Michael Gugino 2020-03-31 17:23:27 UTC

This alert is caused by the cluster autoscaler's inability to scale up.  This alert is normal and expected depending on cluster autoscaler's configuration.  In this particular case, there is a bug in the cluster autoscaler.  I'm going to open a new BZ and link it here.

In the mean time, this bug should remain open until we document the cause and remedy of this particular alert under normal circumstances.

Comment 5 Hongkai Liu 2020-03-31 20:06:06 UTC

Thank Michael for help me fix the autoscaler.

Comment 7 Joel Speed 2020-05-13 15:10:12 UTC

Assigning to Michael McCune as he has a Jira card to document all of the alerts over the next sprint

Comment 8 Alberto 2020-05-29 10:54:27 UTC

tagging with upcomingSprint to re-evaluate priority.

Comment 9 Michael McCune 2020-06-19 20:20:11 UTC

just adding a note here that i am starting to investigate this issue.

Comment 10 Michael McCune 2020-06-24 18:16:06 UTC

i think the next best action we can take is to start creating a document for the cluster-autoscaler-operator to document these alerts and possible guidance around them. Michael Gugino started a pull request[0] for the machine-api-operator to document those alerts, we should do the same for the cluster-autoscaler-operator.


[0] https://github.com/openshift/machine-api-operator/pull/606

Comment 11 Michael McCune 2020-06-24 18:20:37 UTC

i have created an issue on the cluster-autoscaler-operator to track this: https://github.com/openshift/cluster-autoscaler-operator/issues/153

Comment 12 Michael McCune 2020-08-17 19:11:43 UTC

ideally we will have a PR in place for the documentation in the next sprint.

Comment 17 errata-xmlrpc 2020-10-27 15:56:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.