Bug 1813069 - How to handle alert ClusterAutoscalerNodesNotReady
Summary: How to handle alert ClusterAutoscalerNodesNotReady
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-12 20:44 UTC by Hongkai Liu
Modified: 2020-08-04 18:05 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-04 18:05:16 UTC
Target Upstream Version:
Embargoed:
miyadav: needinfo+


Attachments (Terms of Use)
alert.KubeCPUOvercommit (262.86 KB, image/png)
2020-03-12 20:44 UTC, Hongkai Liu
no flags Details
Alerting message (223.29 KB, image/png)
2020-04-20 04:13 UTC, Milind Yadav
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-autoscaler-operator pull 139 0 None closed BUG 1813069: Drop ClusterAutoscalerNodesNotReady alert 2021-02-05 14:11:57 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-08-04 18:05:18 UTC

Description Hongkai Liu 2020-03-12 20:44:25 UTC
Created attachment 1669779 [details]
alert.KubeCPUOvercommit

AlertManager is set up on a CI build form cluster (OCP4.3) to send out notifications to slack.

Recently there are alerts fired and I am not sure how I should debug/fix it.

[FIRING:1] ClusterAutoscalerNodesNotReady cluster-autoscaler-default (metrics 10.128.0.23:8085 openshift-machine-api cluster-autoscaler-default-8776dcb6c-ml4wc openshift-monitoring/k8s cluster-autoscaler-default warning unready)
Cluster Autoscaler has 1 unready nodes
https://coreos.slack.com/archives/CV1UZU53R/p1583952670051400


[FIRING:1] KubeCPUOvercommit (openshift-monitoring/k8s warning)
Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
https://coreos.slack.com/archives/CV1UZU53R/p1584031232061900


It would be nice if there is any documentation about

* What is the timing for autoscaler to trigger an alert (ClusterAutoscalerNodesNotReady)?
* How is ClusterAutoscalerNodesNotReady related to KubeCPUOvercommit? Do they share the similar timing?
* Are we supposed to silence those alerts when autoscaler works properly (since autoscaler should add more node to cluster when the cluster is short of CPUs)? Or how should the debugging procedure be?

Another (not very related) issue:

There are 2 alerts with the same name "KubeCPUOvercommit". See the snapshort.
Is it intended?

Comment 1 Hongkai Liu 2020-03-12 21:28:11 UTC
Isolate issue:
This bug is for ClusterAutoscalerNodesNotReady only.

Will file another one for KubeCPUOvercommit.

Comment 2 Alberto 2020-03-13 09:02:27 UTC
cluster_autoscaler_nodes_count records the total number of "Autoscaler nodes" (i.e expected candidates to join the cluster), labeled by node state. Possible states are ready, unready, notStarted
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/metrics.md#cluster-state

The alert triggers if there's a least one node labeled different than "ready" in the "nodes_count" for more than 20min https://github.com/openshift/cluster-autoscaler-operator/blob/8e6f95038c9eee84ef7e305e2e1f4960c918b30d/pkg/controller/clusterautoscaler/monitoring.go#L184

I reckon this might be triggering for periods where constant increase/decrease of workload happens resulting in constant scale in/out of the cluster.

An "Autoscaler nodes" is always backed by a machine resource. Since we have particular alerts for machines orthogonal to the autoscaler https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L23 which covers machines not getting to become nodes and I reckon there must be kubelet/node healthiness alerts owned by node team, we might want to consider dropping the one created by the autoscaler if it's introducing confusion.

Comment 7 Milind Yadav 2020-04-20 04:13:47 UTC
Created attachment 1680189 [details]
Alerting message

Does this looks good as per the pull request and from reporters point of few if it is what they are looking for , not sure how to get alert on slack channel ?

Comment 8 Milind Yadav 2020-04-20 04:17:47 UTC
Additional info to add with comment#7

Validated on :
[miyadav@miyadav Jira]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-18-184707   True        False         38m     Cluster version is 4.5.0-0.nightly-2020-04-18-184707


Steps :
1.Edited machineset to add a new machine to the cluster
[miyadav@miyadav Jira]$ oc edit machineset miyadav-2004-czb9n-worker
machineset.machine.openshift.io/miyadav-2004-czb9n-worker edited

Actual & Expected : machineset edited successfully.

2.checked the status of machines after 20mins , it was without node attached to it 
NAME                              PHASE          TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
miyadav-2004-czb9n-worker-7zjwx   Provisioning                          37m                       
[miyadav@miyadav Jira]$ 

Actual - Attached snap for alerts message.

Need to get review of these

Comment 9 Milind Yadav 2020-04-20 08:24:10 UTC
As per comment#1 & Since the alert ClusterAutoscalerNodesNotReady is not shown (comment#2) .

Changing the status to VERIFIED

Comment 10 Marcel Härri 2020-05-12 12:59:42 UTC
As a side node, this also happens in 4.4 when you deployed your cluster with 2 workers, then later add an autoscaler for that machineset, but never delete the original machines:


I0512 12:41:21.270671       1 static_autoscaler.go:269] 2 unregistered nodes present
I0512 12:41:21.270699       1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-5vb8b
W0512 12:41:21.270722       1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-5vb8b: node group min size reached, skipping unregistered node removal
I0512 12:41:21.270730       1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-d5kzk
W0512 12:41:21.270750       1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-d5kzk: node group min size reached, skipping unregistered node removal
I0512 12:41:21.271526       1 static_autoscaler.go:343] No unschedulable pods
I0512 12:41:21.271653       1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-5vb8b - node group min size reached
I0512 12:41:21.271701       1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-d5kzk - node group min size reached
I0512 12:41:21.272612       1 scale_down.go:776] No candidates for scale down


Could we have this backported to 4.4 as well?

Comment 12 errata-xmlrpc 2020-08-04 18:05:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.