Description of problem: These alerts will trigger off of metrics from the Cluster Autoscaler: ClusterAutoscalerUnableToScaleCPULimitReached - This alert will fire when the Cluster Autoscaler is unable to add more nodes due to reaching the maximum CPU resource threshold. ClusterAutoscalerUnableToScaleMemoryLimitReached - This alert will fire when the Cluster Autoscaler is unable to add more nodes due to reaching the maximum memory resource threshold. The cluster autoscaler has added cluster_autoscaler_cpu_limits_cores and cluster_autoscaler_memory_limits_bytes metrics, but no related alerts have been fired. Description of problem: These alerts will trigger off of metrics from the Cluster Autoscaler: ClusterAutoscalerUnableToScaleCPULimitReached - This alert will fire when the Cluster Autoscaler is unable to add more nodes due to reaching the maximum CPU resource threshold. ClusterAutoscalerUnableToScaleMemoryLimitReached - This alert will fire when the Cluster Autoscaler is unable to add more nodes due to reaching the maximum memory resource threshold. The cluster autoscaler has added cluster_autoscaler_cpu_limits_cores and cluster_autoscaler_memory_limits_bytes metrics, but no related alerts have been fired. How reproducible: always Steps to Reproduce: 1. Create clusterautoscaler apiVersion: "autoscaling.openshift.io/v1alpha1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: memory: min: 4 max: 128 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 2. Create machineautoscaler apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: finalizers: - machinetarget.autoscaling.openshift.io name: machineautoscaler-b namespace: openshift-machine-api spec: maxReplicas: 8 minReplicas: 1 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: zhsun824-dj9cx-worker-a 3. Create workload 4. Check autoscaler log and check alert Actual results: Autoscaler report "Capping scale-up size due to limit for resource memory", but no alert was fired I0824 12:46:32.239610 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsun824-dj9cx-worker-a I0824 12:46:32.239683 1 scale_up.go:472] Estimated 4 nodes needed in MachineSet/openshift-machine-api/zhsun824-dj9cx-worker-a I0824 12:46:32.239702 1 scale_up.go:726] Capping scale-up size due to limit for resource memory I0824 12:46:32.435805 1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsun824-dj9cx-worker-a 2->3 (max: 8)}] I0824 12:46:32.435844 1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsun824-dj9cx-worker-a size to 3 I0824 12:46:44.055888 1 static_autoscaler.go:335] 2 unregistered nodes present I0824 12:46:44.661240 1 klogx.go:86] Pod openshift-machine-api/scale-up-7b8b8658cf-w5wgx is unschedulable token=`oc sa get-token prometheus-k8s -n openshift-monitoring` oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq | grep ClusterAutoscaler % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9579 0 9579 0 0 389k 0 --:--:-- --:--:-- --:--:-- 389k "alertname": "ClusterAutoscalerUnschedulablePods", Expected results: Additional info:
looking at how we are processing these metrics to create the alert, i have a feeling we might have a logic issue. the alerts will fire if the resource >= resource_maximum, but if the autoscaler prevents the action from happening then this alert will never fire. we might need to redesign how these alerts are working, i will talk with the team to see if there are some thoughts about the best path forward.
Feel the alert should be fired when cluster_autoscaler_cluster_cpu_current_cores + one instance cores > cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}" For example, when I set max cores as 30, one instance has 4 cores. When we scaleup, the cluster_autoscaler_cluster_cpu_current_cores will be 28, the cluster_autoscaler_cpu_limits_cores is 30, so the alert will never fire, not sure if it is the reason. $ oc edit clusterautoscaler default spec: resourceLimits: cores: max: 30 min: 8 $ oc edit machineset zhsungp826-xv86z-worker-c annotations: autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler-c machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "8" machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1" machine.openshift.io/memoryMb: "15360" machine.openshift.io/vCPU: "4" I0826 06:10:21.402734 1 scale_up.go:726] Capping scale-up size due to limit for resource cpu I0826 06:10:21.602614 1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungp826-xv86z-worker-c 1->2 (max: 8)}] I0826 06:10:21.602680 1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsungp826-xv86z-worker-c size to 2 I0826 06:10:28.415182 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache W0826 06:10:28.425270 1 clusterapi_controller.go:455] Machine "zhsungp826-xv86z-worker-c-6t47n" has no providerID I0826 06:10:28.425301 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.090666ms W0826 06:10:33.058633 1 clusterapi_controller.go:455] Machine "zhsungp826-xv86z-worker-c-6t47n" has no providerID I0826 06:10:33.825972 1 klogx.go:86] Pod openshift-machine-api/scale-up-6cc4bdd5db-8rghp is unschedulable I0826 06:10:33.826002 1 klogx.go:86] Pod openshift-machine-api/scale-up-6cc4bdd5db-6b9hw is unschedulable $ oc exec cluster-autoscaler-default-76cfbb67bd-vckpc -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" http://10.129.0.47:8085/metrics | grep "cluster_autoscaler_cluster_cpu_current_cores" # HELP cluster_autoscaler_cluster_cpu_current_cores [ALPHA] Current number of cores in the cluster, minus deleting nodes. # TYPE cluster_autoscaler_cluster_cpu_current_cores gauge cluster_autoscaler_cluster_cpu_current_cores 28 $ oc exec cluster-autoscaler-default-76cfbb67bd-vckpc -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" http://10.129.0.47:8085/metrics | grep "cluster_autoscaler_cpu_limits_cores" # HELP cluster_autoscaler_cpu_limits_cores [ALPHA] Minimum and maximum number of cores in the cluster. # TYPE cluster_autoscaler_cpu_limits_cores gauge cluster_autoscaler_cpu_limits_cores{direction="maximum"} 30 cluster_autoscaler_cpu_limits_cores{direction="minimum"} 8
(In reply to sunzhaohua from comment #2) > Feel the alert should be fired when > cluster_autoscaler_cluster_cpu_current_cores + one instance cores > > cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}" > i had thought about that, but i think we will not be able to calculate it easily. there could be different instance types in the various node groups that the autoscaler knows about, i am investigating to see if there is a way we can pair this with the scale up metric to catch the failure reason. but, i'm not sure we will be able to use that signal either. > For example, when I set max cores as 30, one instance has 4 cores. When we > scaleup, the cluster_autoscaler_cluster_cpu_current_cores will be 28, the > cluster_autoscaler_cpu_limits_cores is 30, so the alert will never fire, not > sure if it is the reason. yes, you've got it exactly correct. the autoscaler won't let us go over the max, so the metric will never reach the max. this was an oversight during the design process, and i think we will need to redesign how this alert gets fired.
i have discussed with the team and have done more code diving and i think we are going to need to create a new metric or supplement the failed scale up metric to solve this completely. the issue here is that current metric for failed scale ups is only focused failures that can happen once the autoscaler has decided to make a scale up. but, the check for resource limits happens before the failure check. this means that the autoscaler doesn't consider itself to have failed a scale up when it is at its resource limit. i will need to discuss more with upstream to see if there are any constraints on adding a new metric that can address this situation, and perhaps others as well. i will report back here when i have some information about the result of that conversation.
we still need to have a deeper discussion with the upstream autoscaling community about adding this extra metric. i am hopeful that with a small PR and demonstration, the upstream community will accept the change. will post further details as they become available.
just adding a note here, i have not had a chance to demonstrate the new metric for the upstream sig. i think that we will need to replace the metric that this alert is based on with a new metric that tracks the number of failed machines due to resource limits.
i am still working towards a solution with upstream.
Discussing this in our bug triage session, Mike is going to double check for an upstream issue and if there isn't one, create it and try to have a conversation about appropriate fixes and how we can move forward
Mike is preparing a patch to propose to the upstream, we are expecting an update within the next week or two
i have posted the patch for this upstream, assuming it is accepted we will need to cherry-pick it and create a new alert for this metric. https://github.com/kubernetes/autoscaler/pull/5059
my patch has merged in the upstream but i don't think it will make the 1.25 release. my next step will be to make a carry patch for our autoscaler and then update the alerting rules.
my patch will be in the 1.25 release which means we will pick it up for the next release. i am preparing a new alerts PR to update with the new values. i don't expect this will make it in for our 4.12 release, but it might be available in the first 4.12.z stream.
i have created a PR to address the problems here, but it will need our autoscaler rebase first.
i've found an error with the original implementation of this and i have prepared a patch, i'm not sure if it appropriate to move this back to POST status, but i am including the new pr link. https://github.com/openshift/cluster-autoscaler-operator/pull/254
i'm moving this back to POST so that the automation will work with github.
Blocked by https://issues.redhat.com/browse/OCPBUGS-2121
Verfied clusterversion: 4.12.0-0.nightly-2022-10-08-162647 1. ClusterAutoscalerUnableToScaleCPULimitReached: $ oc edit clusterautoscaler default spec: resourceLimits: cores: max: 30 min: 8 I1009 06:07:59.109055 1 scale_up.go:751] Capping scale-up size due to limit for resource cpu $ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq | grep ClusterAutoscaler "alertname": "ClusterAutoscalerUnschedulablePods", "alertname": "ClusterAutoscalerUnableToScaleCPULimitReached", "description": "The number of total cores in the cluster has exceeded the maximum number set on the\ncluster autoscaler. This is calculated by summing the cpu capacity for all nodes in the cluster and comparing that number against the maximum cores value set for the\ncluster autoscaler (default 320000 cores). Limits can be adjusted by modifying the ClusterAutoscaler resource.", 2. ClusterAutoscalerUnableToScaleMemoryLimitReached: $ oc edit clusterautoscaler default spec: resourceLimits: memory: min: 4 max: 110 I1009 06:49:47.688212 1 scale_up.go:751] Capping scale-up size due to limit for resource memory $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq | grep ClusterAutoscaler "alertname": "ClusterAutoscalerUnschedulablePods", "alertname": "ClusterAutoscalerUnableToScaleMemoryLimitReached", "description": "The number of total bytes of RAM in the cluster has exceeded the maximum number set on\nthe cluster autoscaler. This is calculated by summing the memory capacity for all nodes in the cluster and comparing that number against the maximum memory bytes value set\nfor the cluster autoscaler (default 6400000 gigabytes). Limits can be adjusted by modifying the ClusterAutoscaler resource.",
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399