Description of problem:
AlertManager on OCP4.3 cluster (with autoscaler configured) fired this alert in the afternoon todoy.
[FIRING:1] ClusterAutoscalerUnschedulablePods cluster-autoscaler-default (metrics 10.130.0.16:8085 openshift-machine-api cluster-autoscaler-default-5476d56447-5ww92 openshift-monitoring/k8s cluster-autoscaler-default warning)
Cluster Autoscaler has 32 unschedulable pods
Version-Release number of selected component (if applicable):
oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.0-0.nightly-2020-03-23-130439 True False 6d12h Cluster version is 4.3.0-0.nightly-2020-03-23-130439
Got this alert once only
Could I expect no such alerts when autoscaler works?
Or what should I do when seeing the alert?
Will attache the pod logs and prometheus screenshot.
Created attachment 1674917 [details]
Created attachment 1674918 [details]
Created attachment 1674920 [details]
This alert is caused by the cluster autoscaler's inability to scale up. This alert is normal and expected depending on cluster autoscaler's configuration. In this particular case, there is a bug in the cluster autoscaler. I'm going to open a new BZ and link it here.
In the mean time, this bug should remain open until we document the cause and remedy of this particular alert under normal circumstances.
Thank Michael for help me fix the autoscaler.
Assigning to Michael McCune as he has a Jira card to document all of the alerts over the next sprint
tagging with upcomingSprint to re-evaluate priority.
just adding a note here that i am starting to investigate this issue.
i think the next best action we can take is to start creating a document for the cluster-autoscaler-operator to document these alerts and possible guidance around them. Michael Gugino started a pull request for the machine-api-operator to document those alerts, we should do the same for the cluster-autoscaler-operator.
i have created an issue on the cluster-autoscaler-operator to track this: https://github.com/openshift/cluster-autoscaler-operator/issues/153
ideally we will have a PR in place for the documentation in the next sprint.