1813069 – How to handle alert ClusterAutoscalerNodesNotReady

Bug 1813069 - How to handle alert ClusterAutoscalerNodesNotReady

Summary: How to handle alert ClusterAutoscalerNodesNotReady

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-12 20:44 UTC by Hongkai Liu
Modified:	2020-08-04 18:05 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-04 18:05:16 UTC
Target Upstream Version:
Embargoed:
Flags:	miyadav: needinfo+

Attachments	(Terms of Use)
alert.KubeCPUOvercommit (262.86 KB, image/png) 2020-03-12 20:44 UTC, Hongkai Liu	no flags	Details
Alerting message (223.29 KB, image/png) 2020-04-20 04:13 UTC, Milind Yadav	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-autoscaler-operator pull 139	0	None	closed	BUG 1813069: Drop ClusterAutoscalerNodesNotReady alert	2021-02-05 14:11:57 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-08-04 18:05:18 UTC

Description Hongkai Liu 2020-03-12 20:44:25 UTC

Created attachment 1669779 [details]
alert.KubeCPUOvercommit

AlertManager is set up on a CI build form cluster (OCP4.3) to send out notifications to slack.

Recently there are alerts fired and I am not sure how I should debug/fix it.

[FIRING:1] ClusterAutoscalerNodesNotReady cluster-autoscaler-default (metrics 10.128.0.23:8085 openshift-machine-api cluster-autoscaler-default-8776dcb6c-ml4wc openshift-monitoring/k8s cluster-autoscaler-default warning unready)
Cluster Autoscaler has 1 unready nodes
https://coreos.slack.com/archives/CV1UZU53R/p1583952670051400


[FIRING:1] KubeCPUOvercommit (openshift-monitoring/k8s warning)
Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
https://coreos.slack.com/archives/CV1UZU53R/p1584031232061900


It would be nice if there is any documentation about

* What is the timing for autoscaler to trigger an alert (ClusterAutoscalerNodesNotReady)?
* How is ClusterAutoscalerNodesNotReady related to KubeCPUOvercommit? Do they share the similar timing?
* Are we supposed to silence those alerts when autoscaler works properly (since autoscaler should add more node to cluster when the cluster is short of CPUs)? Or how should the debugging procedure be?

Another (not very related) issue:

There are 2 alerts with the same name "KubeCPUOvercommit". See the snapshort.
Is it intended?

Comment 1 Hongkai Liu 2020-03-12 21:28:11 UTC

Isolate issue:
This bug is for ClusterAutoscalerNodesNotReady only.

Will file another one for KubeCPUOvercommit.

Comment 2 Alberto 2020-03-13 09:02:27 UTC

cluster_autoscaler_nodes_count records the total number of "Autoscaler nodes" (i.e expected candidates to join the cluster), labeled by node state. Possible states are ready, unready, notStarted
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/metrics.md#cluster-state

The alert triggers if there's a least one node labeled different than "ready" in the "nodes_count" for more than 20min https://github.com/openshift/cluster-autoscaler-operator/blob/8e6f95038c9eee84ef7e305e2e1f4960c918b30d/pkg/controller/clusterautoscaler/monitoring.go#L184

I reckon this might be triggering for periods where constant increase/decrease of workload happens resulting in constant scale in/out of the cluster.

An "Autoscaler nodes" is always backed by a machine resource. Since we have particular alerts for machines orthogonal to the autoscaler https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L23 which covers machines not getting to become nodes and I reckon there must be kubelet/node healthiness alerts owned by node team, we might want to consider dropping the one created by the autoscaler if it's introducing confusion.

Comment 7 Milind Yadav 2020-04-20 04:13:47 UTC

Created attachment 1680189 [details]
Alerting message

Does this looks good as per the pull request and from reporters point of few if it is what they are looking for , not sure how to get alert on slack channel ?

Comment 8 Milind Yadav 2020-04-20 04:17:47 UTC

Additional info to add with comment#7

Validated on :
[miyadav@miyadav Jira]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-18-184707   True        False         38m     Cluster version is 4.5.0-0.nightly-2020-04-18-184707


Steps :
1.Edited machineset to add a new machine to the cluster
[miyadav@miyadav Jira]$ oc edit machineset miyadav-2004-czb9n-worker
machineset.machine.openshift.io/miyadav-2004-czb9n-worker edited

Actual & Expected : machineset edited successfully.

2.checked the status of machines after 20mins , it was without node attached to it 
NAME                              PHASE          TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
miyadav-2004-czb9n-worker-7zjwx   Provisioning                          37m                       
[miyadav@miyadav Jira]$ 

Actual - Attached snap for alerts message.

Need to get review of these

Comment 9 Milind Yadav 2020-04-20 08:24:10 UTC

As per comment#1 & Since the alert ClusterAutoscalerNodesNotReady is not shown (comment#2) .

Changing the status to VERIFIED

Comment 10 Marcel Härri 2020-05-12 12:59:42 UTC

As a side node, this also happens in 4.4 when you deployed your cluster with 2 workers, then later add an autoscaler for that machineset, but never delete the original machines:


I0512 12:41:21.270671       1 static_autoscaler.go:269] 2 unregistered nodes present
I0512 12:41:21.270699       1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-5vb8b
W0512 12:41:21.270722       1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-5vb8b: node group min size reached, skipping unregistered node removal
I0512 12:41:21.270730       1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-d5kzk
W0512 12:41:21.270750       1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-d5kzk: node group min size reached, skipping unregistered node removal
I0512 12:41:21.271526       1 static_autoscaler.go:343] No unschedulable pods
I0512 12:41:21.271653       1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-5vb8b - node group min size reached
I0512 12:41:21.271701       1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-d5kzk - node group min size reached
I0512 12:41:21.272612       1 scale_down.go:776] No candidates for scale down


Could we have this backported to 4.4 as well?

Comment 12 errata-xmlrpc 2020-08-04 18:05:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.