Bug 1813069

Summary: How to handle alert ClusterAutoscalerNodesNotReady
Product: OpenShift Container Platform Reporter: Hongkai Liu <hongkliu>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: Jianwei Hou <jhou>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: unspecified CC: agarcial, aos-bugs, hongkliu, jokerman, mharri
Version: 4.3.0Flags: miyadav: needinfo+
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 18:05:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
alert.KubeCPUOvercommit
none
Alerting message none

Description Hongkai Liu 2020-03-12 20:44:25 UTC
Created attachment 1669779 [details]
alert.KubeCPUOvercommit

AlertManager is set up on a CI build form cluster (OCP4.3) to send out notifications to slack.

Recently there are alerts fired and I am not sure how I should debug/fix it.

[FIRING:1] ClusterAutoscalerNodesNotReady cluster-autoscaler-default (metrics 10.128.0.23:8085 openshift-machine-api cluster-autoscaler-default-8776dcb6c-ml4wc openshift-monitoring/k8s cluster-autoscaler-default warning unready)
Cluster Autoscaler has 1 unready nodes
https://coreos.slack.com/archives/CV1UZU53R/p1583952670051400


[FIRING:1] KubeCPUOvercommit (openshift-monitoring/k8s warning)
Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
https://coreos.slack.com/archives/CV1UZU53R/p1584031232061900


It would be nice if there is any documentation about

* What is the timing for autoscaler to trigger an alert (ClusterAutoscalerNodesNotReady)?
* How is ClusterAutoscalerNodesNotReady related to KubeCPUOvercommit? Do they share the similar timing?
* Are we supposed to silence those alerts when autoscaler works properly (since autoscaler should add more node to cluster when the cluster is short of CPUs)? Or how should the debugging procedure be?

Another (not very related) issue:

There are 2 alerts with the same name "KubeCPUOvercommit". See the snapshort.
Is it intended?

Comment 1 Hongkai Liu 2020-03-12 21:28:11 UTC
Isolate issue:
This bug is for ClusterAutoscalerNodesNotReady only.

Will file another one for KubeCPUOvercommit.

Comment 2 Alberto 2020-03-13 09:02:27 UTC
cluster_autoscaler_nodes_count records the total number of "Autoscaler nodes" (i.e expected candidates to join the cluster), labeled by node state. Possible states are ready, unready, notStarted
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/metrics.md#cluster-state

The alert triggers if there's a least one node labeled different than "ready" in the "nodes_count" for more than 20min https://github.com/openshift/cluster-autoscaler-operator/blob/8e6f95038c9eee84ef7e305e2e1f4960c918b30d/pkg/controller/clusterautoscaler/monitoring.go#L184

I reckon this might be triggering for periods where constant increase/decrease of workload happens resulting in constant scale in/out of the cluster.

An "Autoscaler nodes" is always backed by a machine resource. Since we have particular alerts for machines orthogonal to the autoscaler https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L23 which covers machines not getting to become nodes and I reckon there must be kubelet/node healthiness alerts owned by node team, we might want to consider dropping the one created by the autoscaler if it's introducing confusion.

Comment 7 Milind Yadav 2020-04-20 04:13:47 UTC
Created attachment 1680189 [details]
Alerting message

Does this looks good as per the pull request and from reporters point of few if it is what they are looking for , not sure how to get alert on slack channel ?

Comment 8 Milind Yadav 2020-04-20 04:17:47 UTC
Additional info to add with comment#7

Validated on :
[miyadav@miyadav Jira]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-18-184707   True        False         38m     Cluster version is 4.5.0-0.nightly-2020-04-18-184707


Steps :
1.Edited machineset to add a new machine to the cluster
[miyadav@miyadav Jira]$ oc edit machineset miyadav-2004-czb9n-worker
machineset.machine.openshift.io/miyadav-2004-czb9n-worker edited

Actual & Expected : machineset edited successfully.

2.checked the status of machines after 20mins , it was without node attached to it 
NAME                              PHASE          TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
miyadav-2004-czb9n-worker-7zjwx   Provisioning                          37m                       
[miyadav@miyadav Jira]$ 

Actual - Attached snap for alerts message.

Need to get review of these

Comment 9 Milind Yadav 2020-04-20 08:24:10 UTC
As per comment#1 & Since the alert ClusterAutoscalerNodesNotReady is not shown (comment#2) .

Changing the status to VERIFIED

Comment 10 Marcel Härri 2020-05-12 12:59:42 UTC
As a side node, this also happens in 4.4 when you deployed your cluster with 2 workers, then later add an autoscaler for that machineset, but never delete the original machines:


I0512 12:41:21.270671       1 static_autoscaler.go:269] 2 unregistered nodes present
I0512 12:41:21.270699       1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-5vb8b
W0512 12:41:21.270722       1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-5vb8b: node group min size reached, skipping unregistered node removal
I0512 12:41:21.270730       1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-d5kzk
W0512 12:41:21.270750       1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-d5kzk: node group min size reached, skipping unregistered node removal
I0512 12:41:21.271526       1 static_autoscaler.go:343] No unschedulable pods
I0512 12:41:21.271653       1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-5vb8b - node group min size reached
I0512 12:41:21.271701       1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-d5kzk - node group min size reached
I0512 12:41:21.272612       1 scale_down.go:776] No candidates for scale down


Could we have this backported to 4.4 as well?

Comment 12 errata-xmlrpc 2020-08-04 18:05:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409