Created attachment 1669779 [details] alert.KubeCPUOvercommit AlertManager is set up on a CI build form cluster (OCP4.3) to send out notifications to slack. Recently there are alerts fired and I am not sure how I should debug/fix it. [FIRING:1] ClusterAutoscalerNodesNotReady cluster-autoscaler-default (metrics 10.128.0.23:8085 openshift-machine-api cluster-autoscaler-default-8776dcb6c-ml4wc openshift-monitoring/k8s cluster-autoscaler-default warning unready) Cluster Autoscaler has 1 unready nodes https://coreos.slack.com/archives/CV1UZU53R/p1583952670051400 [FIRING:1] KubeCPUOvercommit (openshift-monitoring/k8s warning) Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. https://coreos.slack.com/archives/CV1UZU53R/p1584031232061900 It would be nice if there is any documentation about * What is the timing for autoscaler to trigger an alert (ClusterAutoscalerNodesNotReady)? * How is ClusterAutoscalerNodesNotReady related to KubeCPUOvercommit? Do they share the similar timing? * Are we supposed to silence those alerts when autoscaler works properly (since autoscaler should add more node to cluster when the cluster is short of CPUs)? Or how should the debugging procedure be? Another (not very related) issue: There are 2 alerts with the same name "KubeCPUOvercommit". See the snapshort. Is it intended?
Isolate issue: This bug is for ClusterAutoscalerNodesNotReady only. Will file another one for KubeCPUOvercommit.
cluster_autoscaler_nodes_count records the total number of "Autoscaler nodes" (i.e expected candidates to join the cluster), labeled by node state. Possible states are ready, unready, notStarted https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/metrics.md#cluster-state The alert triggers if there's a least one node labeled different than "ready" in the "nodes_count" for more than 20min https://github.com/openshift/cluster-autoscaler-operator/blob/8e6f95038c9eee84ef7e305e2e1f4960c918b30d/pkg/controller/clusterautoscaler/monitoring.go#L184 I reckon this might be triggering for periods where constant increase/decrease of workload happens resulting in constant scale in/out of the cluster. An "Autoscaler nodes" is always backed by a machine resource. Since we have particular alerts for machines orthogonal to the autoscaler https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L23 which covers machines not getting to become nodes and I reckon there must be kubelet/node healthiness alerts owned by node team, we might want to consider dropping the one created by the autoscaler if it's introducing confusion.
Created attachment 1680189 [details] Alerting message Does this looks good as per the pull request and from reporters point of few if it is what they are looking for , not sure how to get alert on slack channel ?
Additional info to add with comment#7 Validated on : [miyadav@miyadav Jira]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-04-18-184707 True False 38m Cluster version is 4.5.0-0.nightly-2020-04-18-184707 Steps : 1.Edited machineset to add a new machine to the cluster [miyadav@miyadav Jira]$ oc edit machineset miyadav-2004-czb9n-worker machineset.machine.openshift.io/miyadav-2004-czb9n-worker edited Actual & Expected : machineset edited successfully. 2.checked the status of machines after 20mins , it was without node attached to it NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE miyadav-2004-czb9n-worker-7zjwx Provisioning 37m [miyadav@miyadav Jira]$ Actual - Attached snap for alerts message. Need to get review of these
As per comment#1 & Since the alert ClusterAutoscalerNodesNotReady is not shown (comment#2) . Changing the status to VERIFIED
As a side node, this also happens in 4.4 when you deployed your cluster with 2 workers, then later add an autoscaler for that machineset, but never delete the original machines: I0512 12:41:21.270671 1 static_autoscaler.go:269] 2 unregistered nodes present I0512 12:41:21.270699 1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-5vb8b W0512 12:41:21.270722 1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-5vb8b: node group min size reached, skipping unregistered node removal I0512 12:41:21.270730 1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-d5kzk W0512 12:41:21.270750 1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-d5kzk: node group min size reached, skipping unregistered node removal I0512 12:41:21.271526 1 static_autoscaler.go:343] No unschedulable pods I0512 12:41:21.271653 1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-5vb8b - node group min size reached I0512 12:41:21.271701 1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-d5kzk - node group min size reached I0512 12:41:21.272612 1 scale_down.go:776] No candidates for scale down Could we have this backported to 4.4 as well?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409