Bug 1819029 - How to handle alert ClusterAutoscalerUnschedulablePods
Summary: How to handle alert ClusterAutoscalerUnschedulablePods
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.0
Assignee: Michael McCune
QA Contact: sunzhaohua
Depends On:
Blocks: 1827307
TreeView+ depends on / blocked
Reported: 2020-03-31 01:33 UTC by Hongkai Liu
Modified: 2020-09-18 15:36 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1820654 (view as bug list)
Last Closed:
Target Upstream Version:

Attachments (Terms of Use)
openshift-machine-api.cluster-autoscaler-default-5476d56447-5ww92.24h.log (5.37 MB, text/plain)
2020-03-31 01:35 UTC, Hongkai Liu
no flags Details
openshift-machine-api.machine-api-controllers-7c696b9657-m8t4c.machine-controller.24h.log (2.45 MB, text/plain)
2020-03-31 01:36 UTC, Hongkai Liu
no flags Details
prometheus.query (183.60 KB, image/png)
2020-03-31 01:38 UTC, Hongkai Liu
no flags Details

System ID Priority Status Summary Last Updated
Github openshift cluster-autoscaler-operator pull 163 None closed Bug 1819029: add a document about autoscaler alerts 2020-09-18 15:35:28 UTC

Description Hongkai Liu 2020-03-31 01:33:50 UTC
Description of problem:
AlertManager on OCP4.3 cluster (with autoscaler configured) fired this alert in the afternoon todoy.

[FIRING:1] ClusterAutoscalerUnschedulablePods cluster-autoscaler-default (metrics openshift-machine-api cluster-autoscaler-default-5476d56447-5ww92 openshift-monitoring/k8s cluster-autoscaler-default warning)
Cluster Autoscaler has 32 unschedulable pods


Version-Release number of selected component (if applicable):
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-23-130439   True        False         6d12h   Cluster version is 4.3.0-0.nightly-2020-03-23-130439

How reproducible:
Got this alert once only

Could I expect no such alerts when autoscaler works?
Or what should I do when seeing the alert?

Additional info:
Will attache the pod logs and prometheus screenshot.

Comment 1 Hongkai Liu 2020-03-31 01:35:40 UTC
Created attachment 1674917 [details]

Comment 2 Hongkai Liu 2020-03-31 01:36:54 UTC
Created attachment 1674918 [details]

Comment 3 Hongkai Liu 2020-03-31 01:38:02 UTC
Created attachment 1674920 [details]

Comment 4 Michael Gugino 2020-03-31 17:23:27 UTC
This alert is caused by the cluster autoscaler's inability to scale up.  This alert is normal and expected depending on cluster autoscaler's configuration.  In this particular case, there is a bug in the cluster autoscaler.  I'm going to open a new BZ and link it here.

In the mean time, this bug should remain open until we document the cause and remedy of this particular alert under normal circumstances.

Comment 5 Hongkai Liu 2020-03-31 20:06:06 UTC
Thank Michael for help me fix the autoscaler.

Comment 7 Joel Speed 2020-05-13 15:10:12 UTC
Assigning to Michael McCune as he has a Jira card to document all of the alerts over the next sprint

Comment 8 Alberto 2020-05-29 10:54:27 UTC
tagging with upcomingSprint to re-evaluate priority.

Comment 9 Michael McCune 2020-06-19 20:20:11 UTC
just adding a note here that i am starting to investigate this issue.

Comment 10 Michael McCune 2020-06-24 18:16:06 UTC
i think the next best action we can take is to start creating a document for the cluster-autoscaler-operator to document these alerts and possible guidance around them. Michael Gugino started a pull request[0] for the machine-api-operator to document those alerts, we should do the same for the cluster-autoscaler-operator.

[0] https://github.com/openshift/machine-api-operator/pull/606

Comment 11 Michael McCune 2020-06-24 18:20:37 UTC
i have created an issue on the cluster-autoscaler-operator to track this: https://github.com/openshift/cluster-autoscaler-operator/issues/153

Comment 12 Michael McCune 2020-08-17 19:11:43 UTC
ideally we will have a PR in place for the documentation in the next sprint.

Note You need to log in before you can comment on or make changes to this bug.