Bug 1997396
Summary: | No alerts have triggered for CPU and Memory limit with Cluster Autoscaler | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yihao Guo <yihguo> |
Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> |
Cloud Compute sub component: | Cluster Autoscaler | QA Contact: | sunzhaohua <zhsun> |
Status: | CLOSED ERRATA | Docs Contact: | Jeana Routh <jrouth> |
Severity: | medium | ||
Priority: | medium | CC: | aos-bugs, mfedosin, mimccune, zhsun |
Version: | 4.9 | ||
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
* Previously, the cluster autoscaler metrics for cluster CPU and memory usage would never reach, or exceed, the limits set by the `ClusterAutoscaler` resource. As a result, no alerts were fired when the cluster autoscaler could not scale due to resource limitations. With this release, a new metric called `cluster_autoscaler_skipped_scale_events_count` is added to the cluster autoscaler to more accurately detect when resource limits are reached or exceeded. Alerts will now fire when the cluster autoscaler is unable to scale the cluster up because it has reached the cluster resource limits. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1997396[*BZ#1997396*])
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:46:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yihao Guo
2021-08-25 07:16:21 UTC
looking at how we are processing these metrics to create the alert, i have a feeling we might have a logic issue. the alerts will fire if the resource >= resource_maximum, but if the autoscaler prevents the action from happening then this alert will never fire. we might need to redesign how these alerts are working, i will talk with the team to see if there are some thoughts about the best path forward. Feel the alert should be fired when cluster_autoscaler_cluster_cpu_current_cores + one instance cores > cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}" For example, when I set max cores as 30, one instance has 4 cores. When we scaleup, the cluster_autoscaler_cluster_cpu_current_cores will be 28, the cluster_autoscaler_cpu_limits_cores is 30, so the alert will never fire, not sure if it is the reason. $ oc edit clusterautoscaler default spec: resourceLimits: cores: max: 30 min: 8 $ oc edit machineset zhsungp826-xv86z-worker-c annotations: autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler-c machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "8" machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1" machine.openshift.io/memoryMb: "15360" machine.openshift.io/vCPU: "4" I0826 06:10:21.402734 1 scale_up.go:726] Capping scale-up size due to limit for resource cpu I0826 06:10:21.602614 1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungp826-xv86z-worker-c 1->2 (max: 8)}] I0826 06:10:21.602680 1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsungp826-xv86z-worker-c size to 2 I0826 06:10:28.415182 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache W0826 06:10:28.425270 1 clusterapi_controller.go:455] Machine "zhsungp826-xv86z-worker-c-6t47n" has no providerID I0826 06:10:28.425301 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.090666ms W0826 06:10:33.058633 1 clusterapi_controller.go:455] Machine "zhsungp826-xv86z-worker-c-6t47n" has no providerID I0826 06:10:33.825972 1 klogx.go:86] Pod openshift-machine-api/scale-up-6cc4bdd5db-8rghp is unschedulable I0826 06:10:33.826002 1 klogx.go:86] Pod openshift-machine-api/scale-up-6cc4bdd5db-6b9hw is unschedulable $ oc exec cluster-autoscaler-default-76cfbb67bd-vckpc -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" http://10.129.0.47:8085/metrics | grep "cluster_autoscaler_cluster_cpu_current_cores" # HELP cluster_autoscaler_cluster_cpu_current_cores [ALPHA] Current number of cores in the cluster, minus deleting nodes. # TYPE cluster_autoscaler_cluster_cpu_current_cores gauge cluster_autoscaler_cluster_cpu_current_cores 28 $ oc exec cluster-autoscaler-default-76cfbb67bd-vckpc -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" http://10.129.0.47:8085/metrics | grep "cluster_autoscaler_cpu_limits_cores" # HELP cluster_autoscaler_cpu_limits_cores [ALPHA] Minimum and maximum number of cores in the cluster. # TYPE cluster_autoscaler_cpu_limits_cores gauge cluster_autoscaler_cpu_limits_cores{direction="maximum"} 30 cluster_autoscaler_cpu_limits_cores{direction="minimum"} 8 (In reply to sunzhaohua from comment #2) > Feel the alert should be fired when > cluster_autoscaler_cluster_cpu_current_cores + one instance cores > > cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}" > i had thought about that, but i think we will not be able to calculate it easily. there could be different instance types in the various node groups that the autoscaler knows about, i am investigating to see if there is a way we can pair this with the scale up metric to catch the failure reason. but, i'm not sure we will be able to use that signal either. > For example, when I set max cores as 30, one instance has 4 cores. When we > scaleup, the cluster_autoscaler_cluster_cpu_current_cores will be 28, the > cluster_autoscaler_cpu_limits_cores is 30, so the alert will never fire, not > sure if it is the reason. yes, you've got it exactly correct. the autoscaler won't let us go over the max, so the metric will never reach the max. this was an oversight during the design process, and i think we will need to redesign how this alert gets fired. i have discussed with the team and have done more code diving and i think we are going to need to create a new metric or supplement the failed scale up metric to solve this completely. the issue here is that current metric for failed scale ups is only focused failures that can happen once the autoscaler has decided to make a scale up. but, the check for resource limits happens before the failure check. this means that the autoscaler doesn't consider itself to have failed a scale up when it is at its resource limit. i will need to discuss more with upstream to see if there are any constraints on adding a new metric that can address this situation, and perhaps others as well. i will report back here when i have some information about the result of that conversation. we still need to have a deeper discussion with the upstream autoscaling community about adding this extra metric. i am hopeful that with a small PR and demonstration, the upstream community will accept the change. will post further details as they become available. just adding a note here, i have not had a chance to demonstrate the new metric for the upstream sig. i think that we will need to replace the metric that this alert is based on with a new metric that tracks the number of failed machines due to resource limits. i am still working towards a solution with upstream. Discussing this in our bug triage session, Mike is going to double check for an upstream issue and if there isn't one, create it and try to have a conversation about appropriate fixes and how we can move forward Mike is preparing a patch to propose to the upstream, we are expecting an update within the next week or two i have posted the patch for this upstream, assuming it is accepted we will need to cherry-pick it and create a new alert for this metric. https://github.com/kubernetes/autoscaler/pull/5059 my patch has merged in the upstream but i don't think it will make the 1.25 release. my next step will be to make a carry patch for our autoscaler and then update the alerting rules. my patch will be in the 1.25 release which means we will pick it up for the next release. i am preparing a new alerts PR to update with the new values. i don't expect this will make it in for our 4.12 release, but it might be available in the first 4.12.z stream. i have created a PR to address the problems here, but it will need our autoscaler rebase first. i've found an error with the original implementation of this and i have prepared a patch, i'm not sure if it appropriate to move this back to POST status, but i am including the new pr link. https://github.com/openshift/cluster-autoscaler-operator/pull/254 i'm moving this back to POST so that the automation will work with github. Verfied clusterversion: 4.12.0-0.nightly-2022-10-08-162647 1. ClusterAutoscalerUnableToScaleCPULimitReached: $ oc edit clusterautoscaler default spec: resourceLimits: cores: max: 30 min: 8 I1009 06:07:59.109055 1 scale_up.go:751] Capping scale-up size due to limit for resource cpu $ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq | grep ClusterAutoscaler "alertname": "ClusterAutoscalerUnschedulablePods", "alertname": "ClusterAutoscalerUnableToScaleCPULimitReached", "description": "The number of total cores in the cluster has exceeded the maximum number set on the\ncluster autoscaler. This is calculated by summing the cpu capacity for all nodes in the cluster and comparing that number against the maximum cores value set for the\ncluster autoscaler (default 320000 cores). Limits can be adjusted by modifying the ClusterAutoscaler resource.", 2. ClusterAutoscalerUnableToScaleMemoryLimitReached: $ oc edit clusterautoscaler default spec: resourceLimits: memory: min: 4 max: 110 I1009 06:49:47.688212 1 scale_up.go:751] Capping scale-up size due to limit for resource memory $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq | grep ClusterAutoscaler "alertname": "ClusterAutoscalerUnschedulablePods", "alertname": "ClusterAutoscalerUnableToScaleMemoryLimitReached", "description": "The number of total bytes of RAM in the cluster has exceeded the maximum number set on\nthe cluster autoscaler. This is calculated by summing the memory capacity for all nodes in the cluster and comparing that number against the maximum memory bytes value set\nfor the cluster autoscaler (default 6400000 gigabytes). Limits can be adjusted by modifying the ClusterAutoscaler resource.", Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |