Bug 1997396

Summary:	No alerts have triggered for CPU and Memory limit with Cluster Autoscaler
Product:	OpenShift Container Platform	Reporter:	Yihao Guo <yihguo>
Component:	Cloud Compute	Assignee:	Michael McCune <mimccune>
Cloud Compute sub component:	Cluster Autoscaler	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED ERRATA	Docs Contact:	Jeana Routh <jrouth>
Severity:	medium
Priority:	medium	CC:	aos-bugs, mfedosin, mimccune, zhsun
Version:	4.9
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	* Previously, the cluster autoscaler metrics for cluster CPU and memory usage would never reach, or exceed, the limits set by the `ClusterAutoscaler` resource. As a result, no alerts were fired when the cluster autoscaler could not scale due to resource limitations. With this release, a new metric called `cluster_autoscaler_skipped_scale_events_count` is added to the cluster autoscaler to more accurately detect when resource limits are reached or exceeded. Alerts will now fire when the cluster autoscaler is unable to scale the cluster up because it has reached the cluster resource limits. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1997396[BZ#1997396])	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:46:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yihao Guo 2021-08-25 07:16:21 UTC

Description of problem:

These alerts will trigger off of metrics from the Cluster Autoscaler:

ClusterAutoscalerUnableToScaleCPULimitReached - This alert will fire when the Cluster Autoscaler is unable to add more nodes due to reaching the maximum CPU resource threshold.
ClusterAutoscalerUnableToScaleMemoryLimitReached - This alert will fire when the Cluster Autoscaler is unable to add more nodes due to reaching the maximum memory resource threshold.

The cluster autoscaler has added cluster_autoscaler_cpu_limits_cores and cluster_autoscaler_memory_limits_bytes metrics, but no related alerts have been fired.


Description of problem:

These alerts will trigger off of metrics from the Cluster Autoscaler:

ClusterAutoscalerUnableToScaleCPULimitReached - This alert will fire when the Cluster Autoscaler is unable to add more nodes due to reaching the maximum CPU resource threshold.
ClusterAutoscalerUnableToScaleMemoryLimitReached - This alert will fire when the Cluster Autoscaler is unable to add more nodes due to reaching the maximum memory resource threshold.

The cluster autoscaler has added cluster_autoscaler_cpu_limits_cores and cluster_autoscaler_memory_limits_bytes metrics, but no related alerts have been fired.


How reproducible:
always

Steps to Reproduce:
1. Create clusterautoscaler
apiVersion: "autoscaling.openshift.io/v1alpha1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    memory:
      min: 4
      max: 128
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s      
2. Create machineautoscaler
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  finalizers:
  - machinetarget.autoscaling.openshift.io
  name: machineautoscaler-b
  namespace: openshift-machine-api
spec:
  maxReplicas: 8
  minReplicas: 1
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: zhsun824-dj9cx-worker-a
3. Create workload
4. Check autoscaler log and check alert

Actual results:
Autoscaler report "Capping scale-up size due to limit for resource memory", but no alert was fired

I0824 12:46:32.239610       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsun824-dj9cx-worker-a
I0824 12:46:32.239683       1 scale_up.go:472] Estimated 4 nodes needed in MachineSet/openshift-machine-api/zhsun824-dj9cx-worker-a
I0824 12:46:32.239702       1 scale_up.go:726] Capping scale-up size due to limit for resource memory
I0824 12:46:32.435805       1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsun824-dj9cx-worker-a 2->3 (max: 8)}]
I0824 12:46:32.435844       1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsun824-dj9cx-worker-a size to 3
I0824 12:46:44.055888       1 static_autoscaler.go:335] 2 unregistered nodes present
I0824 12:46:44.661240       1 klogx.go:86] Pod openshift-machine-api/scale-up-7b8b8658cf-w5wgx is unschedulable

token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq | grep ClusterAutoscaler
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9579    0  9579    0     0   389k      0 --:--:-- --:--:-- --:--:--  389k
          "alertname": "ClusterAutoscalerUnschedulablePods",

Expected results:

Additional info:

Comment 1 Michael McCune 2021-08-25 13:07:04 UTC

looking at how we are processing these metrics to create the alert, i have a feeling we might have a logic issue. the alerts will fire if the resource >= resource_maximum, but if the autoscaler prevents the action from happening then this alert will never fire. we might need to redesign how these alerts are working, i will talk with the team to see if there are some thoughts about the best path forward.

Comment 2 sunzhaohua 2021-08-26 06:35:15 UTC

Feel the alert should be fired when cluster_autoscaler_cluster_cpu_current_cores + one instance cores > cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}"

For example, when I set max cores as 30, one instance has 4 cores. When we scaleup, the cluster_autoscaler_cluster_cpu_current_cores will be 28, the cluster_autoscaler_cpu_limits_cores is 30, so the alert will never fire, not sure if it is the reason.


$ oc edit clusterautoscaler default
spec:
  resourceLimits:
    cores:
      max: 30
      min: 8

$ oc edit machineset zhsungp826-xv86z-worker-c
  annotations:
    autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler-c
    machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "8"
    machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1"
    machine.openshift.io/memoryMb: "15360"
    machine.openshift.io/vCPU: "4"


I0826 06:10:21.402734       1 scale_up.go:726] Capping scale-up size due to limit for resource cpu
I0826 06:10:21.602614       1 scale_up.go:586] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungp826-xv86z-worker-c 1->2 (max: 8)}]
I0826 06:10:21.602680       1 scale_up.go:675] Scale-up: setting group MachineSet/openshift-machine-api/zhsungp826-xv86z-worker-c size to 2
I0826 06:10:28.415182       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
W0826 06:10:28.425270       1 clusterapi_controller.go:455] Machine "zhsungp826-xv86z-worker-c-6t47n" has no providerID
I0826 06:10:28.425301       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.090666ms
W0826 06:10:33.058633       1 clusterapi_controller.go:455] Machine "zhsungp826-xv86z-worker-c-6t47n" has no providerID
I0826 06:10:33.825972       1 klogx.go:86] Pod openshift-machine-api/scale-up-6cc4bdd5db-8rghp is unschedulable
I0826 06:10:33.826002       1 klogx.go:86] Pod openshift-machine-api/scale-up-6cc4bdd5db-6b9hw is unschedulable


$ oc exec cluster-autoscaler-default-76cfbb67bd-vckpc  -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" http://10.129.0.47:8085/metrics | grep "cluster_autoscaler_cluster_cpu_current_cores"
# HELP cluster_autoscaler_cluster_cpu_current_cores [ALPHA] Current number of cores in the cluster, minus deleting nodes.
# TYPE cluster_autoscaler_cluster_cpu_current_cores gauge
cluster_autoscaler_cluster_cpu_current_cores 28

$ oc exec cluster-autoscaler-default-76cfbb67bd-vckpc  -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" http://10.129.0.47:8085/metrics | grep "cluster_autoscaler_cpu_limits_cores"
# HELP cluster_autoscaler_cpu_limits_cores [ALPHA] Minimum and maximum number of cores in the cluster.
# TYPE cluster_autoscaler_cpu_limits_cores gauge
cluster_autoscaler_cpu_limits_cores{direction="maximum"} 30
cluster_autoscaler_cpu_limits_cores{direction="minimum"} 8

Comment 3 Michael McCune 2021-08-26 13:32:52 UTC

(In reply to sunzhaohua from comment #2)
> Feel the alert should be fired when
> cluster_autoscaler_cluster_cpu_current_cores + one instance cores >
> cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}"
> 

i had thought about that, but i think we will not be able to calculate it easily. there could be different instance types in the various node groups that the autoscaler knows about, i am investigating to see if there is a way we can pair this with the scale up metric to catch the failure reason. but, i'm not sure we will be able to use that signal either.

> For example, when I set max cores as 30, one instance has 4 cores. When we
> scaleup, the cluster_autoscaler_cluster_cpu_current_cores will be 28, the
> cluster_autoscaler_cpu_limits_cores is 30, so the alert will never fire, not
> sure if it is the reason.

yes, you've got it exactly correct. the autoscaler won't let us go over the max, so the metric will never reach the max. this was an oversight during the design process, and i think we will need to redesign how this alert gets fired.

Comment 6 Michael McCune 2021-10-19 18:56:17 UTC

i have discussed with the team and have done more code diving and i think we are going to need to create a new metric or supplement the failed scale up metric to solve this completely. the issue here is that current metric for failed scale ups is only focused failures that can happen once the autoscaler has decided to make a scale up. but, the check for resource limits happens before the failure check. this means that the autoscaler doesn't consider itself to have failed a scale up when it is at its resource limit.

i will need to discuss more with upstream to see if there are any constraints on adding a new metric that can address this situation, and perhaps others as well. i will report back here when i have some information about the result of that conversation.

Comment 8 Michael McCune 2022-01-17 14:44:48 UTC

we still need to have a deeper discussion with the upstream autoscaling community about adding this extra metric. i am hopeful that with a small PR and demonstration, the upstream community will accept the change. will post further details as they become available.

Comment 9 Michael McCune 2022-03-09 14:17:27 UTC

just adding a note here, i have not had a chance to demonstrate the new metric for the upstream sig. i think that we will need to replace the metric that this alert is based on with a new metric that tracks the number of failed machines due to resource limits.

Comment 10 Michael McCune 2022-04-22 13:11:35 UTC

i am still working towards a solution with upstream.

Comment 11 Joel Speed 2022-05-26 13:47:39 UTC

Discussing this in our bug triage session, Mike is going to double check for an upstream issue and if there isn't one, create it and try to have a conversation about appropriate fixes and how we can move forward

Comment 12 Joel Speed 2022-07-18 15:21:24 UTC

Mike is preparing a patch to propose to the upstream, we are expecting an update within the next week or two

Comment 13 Michael McCune 2022-07-28 15:40:00 UTC

i have posted the patch for this upstream, assuming it is accepted we will need to cherry-pick it and create a new alert for this metric.

https://github.com/kubernetes/autoscaler/pull/5059

Comment 14 Michael McCune 2022-08-09 20:17:42 UTC

my patch has merged in the upstream but i don't think it will make the 1.25 release. my next step will be to make a carry patch for our autoscaler and then update the alerting rules.

Comment 15 Michael McCune 2022-09-08 21:02:38 UTC

my patch will be in the 1.25 release which means we will pick it up for the next release. i am preparing a new alerts PR to update with the new values. i don't expect this will make it in for our 4.12 release, but it might be available in the first 4.12.z stream.

Comment 16 Michael McCune 2022-09-14 19:19:51 UTC

i have created a PR to address the problems here, but it will need our autoscaler rebase first.

Comment 19 Michael McCune 2022-10-05 16:18:34 UTC

i've found an error with the original implementation of this and i have prepared a patch, i'm not sure if it appropriate to move this back to POST status, but i am including the new pr link.

https://github.com/openshift/cluster-autoscaler-operator/pull/254

Comment 20 Michael McCune 2022-10-05 16:25:30 UTC

i'm moving this back to POST so that the automation will work with github.

Comment 22 sunzhaohua 2022-10-08 07:08:25 UTC

Blocked by https://issues.redhat.com/browse/OCPBUGS-2121

Comment 23 sunzhaohua 2022-10-09 07:09:05 UTC

Verfied
clusterversion: 4.12.0-0.nightly-2022-10-08-162647
1. ClusterAutoscalerUnableToScaleCPULimitReached:
$ oc edit clusterautoscaler default
spec:
  resourceLimits:
    cores:
      max: 30
      min: 8
I1009 06:07:59.109055       1 scale_up.go:751] Capping scale-up size due to limit for resource cpu

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq | grep ClusterAutoscaler
          "alertname": "ClusterAutoscalerUnschedulablePods",
          "alertname": "ClusterAutoscalerUnableToScaleCPULimitReached",
          "description": "The number of total cores in the cluster has exceeded the maximum number set on the\ncluster autoscaler. This is calculated by summing the cpu capacity for all nodes in the cluster and comparing that number against the maximum cores value set for the\ncluster autoscaler (default 320000 cores). Limits can be adjusted by modifying the ClusterAutoscaler resource.",

2. ClusterAutoscalerUnableToScaleMemoryLimitReached:
$ oc edit clusterautoscaler default
spec:
  resourceLimits:
    memory:
      min: 4
      max: 110

I1009 06:49:47.688212       1 scale_up.go:751] Capping scale-up size due to limit for resource memory
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq | grep ClusterAutoscaler
          "alertname": "ClusterAutoscalerUnschedulablePods",
          "alertname": "ClusterAutoscalerUnableToScaleMemoryLimitReached",
          "description": "The number of total bytes of RAM in the cluster has exceeded the maximum number set on\nthe cluster autoscaler. This is calculated by summing the memory capacity for all nodes in the cluster and comparing that number against the maximum memory bytes value set\nfor the cluster autoscaler (default 6400000 gigabytes). Limits can be adjusted by modifying the ClusterAutoscaler resource.",

Comment 26 errata-xmlrpc 2023-01-17 19:46:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399