Description of problem: Autoscaler alerts without summary and description annotations Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-11-22-195410 How reproducible: Always Steps to Reproduce: liuhuali@Lius-MacBook-Pro ~ % oc create -f clusterautoscaler.yaml clusterautoscaler.autoscaling.openshift.io/default created liuhuali@Lius-MacBook-Pro ~ % cat clusterautoscaler.yaml apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: resourceLimits: memory: min: 4 max: 128 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s liuhuali@Lius-MacBook-Pro ~ % oc -n openshift-machine-api get prometheusrule NAME AGE cluster-autoscaler-default 2m32s machine-api-operator-prometheus-rules 62m liuhuali@Lius-MacBook-Pro ~ % oc -n openshift-machine-api get prometheusrule cluster-autoscaler-default -o yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: "2021-11-23T14:44:28Z" generation: 1 labels: prometheus: k8s role: alert-rules name: cluster-autoscaler-default namespace: openshift-machine-api ownerReferences: - apiVersion: autoscaling.openshift.io/v1 blockOwnerDeletion: true controller: true kind: ClusterAutoscaler name: default uid: 057f8bf9-ec0b-4fc8-8ab2-a3afb40a0251 resourceVersion: "37912" uid: 3a070375-df61-43e8-b9e2-acaf45dfcadd spec: groups: - name: general.rules rules: - alert: ClusterAutoscalerUnschedulablePods annotations: message: Cluster Autoscaler has {{ $value }} unschedulable pods expr: cluster_autoscaler_unschedulable_pods_count{service="cluster-autoscaler-default"} > 0 for: 20m labels: severity: info - alert: ClusterAutoscalerNotSafeToScale annotations: message: Cluster Autoscaler is reporting that the cluster is not ready for scaling expr: cluster_autoscaler_cluster_safe_to_autoscale{service="cluster-autoscaler-default"} != 1 for: 15m labels: severity: warning - alert: ClusterAutoscalerUnableToScaleCPULimitReached annotations: message: Cluster Autoscaler has reached its CPU core limit and is unable to scale out expr: cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"} for: 15m labels: severity: info - alert: ClusterAutoscalerUnableToScaleMemoryLimitReached annotations: message: Cluster Autoscaler has reached its Memory bytes limit and is unable to scale out expr: cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"} for: 15m labels: severity: info liuhuali@Lius-MacBook-Pro ~ % Actual results: Alerts without summary and description annotations: - ClusterAutoscalerUnschedulablePods - ClusterAutoscalerNotSafeToScale - ClusterAutoscalerUnableToScaleCPULimitReached - ClusterAutoscalerUnableToScaleMemoryLimitReached Expected results: Alerts MUST include summary and description annotations. Additional info: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide
Set up cluster using cluster-bot with https://github.com/openshift/cluster-autoscaler-operator/pull/233 Verified the four Alerts with summary and description now: - ClusterAutoscalerUnschedulablePods - ClusterAutoscalerNotSafeToScale - ClusterAutoscalerUnableToScaleCPULimitReached - ClusterAutoscalerUnableToScaleMemoryLimitReached liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.ci.test-2021-11-26-035937-ci-ln-ptd866t-latest True False 2m7s Cluster version is 4.10.0-0.ci.test-2021-11-26-035937-ci-ln-ptd866t-latest liuhuali@Lius-MacBook-Pro huali-test % oc create -f clusterautoscale.yaml clusterautoscaler.autoscaling.openshift.io/default created liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-machine-api get prometheusrule NAME AGE cluster-autoscaler-default 51m machine-api-operator-prometheus-rules 83m liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-machine-api get prometheusrule cluster-autoscaler-default -o yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: "2021-11-26T04:39:25Z" generation: 1 labels: prometheus: k8s role: alert-rules name: cluster-autoscaler-default namespace: openshift-machine-api ownerReferences: - apiVersion: autoscaling.openshift.io/v1 blockOwnerDeletion: true controller: true kind: ClusterAutoscaler name: default uid: a2dce283-5acc-4d18-aa77-eb73902ef24b resourceVersion: "34050" uid: 2678c69c-b9f9-47c2-aaf6-577da492e0e8 spec: groups: - name: general.rules rules: - alert: ClusterAutoscalerUnschedulablePods annotations: description: |- The cluster autoscaler is unable to scale up and is alerting that there are unschedulable pods because of this condition. This may be caused by the cluster autoscaler reaching its resources limits, or by Kubernetes waiting for new nodes to become ready. summary: Cluster Autoscaler has {{ $value }} unschedulable pods expr: cluster_autoscaler_unschedulable_pods_count{service="cluster-autoscaler-default"} > 0 for: 20m labels: severity: info - alert: ClusterAutoscalerNotSafeToScale annotations: description: |- The cluster autoscaler has detected that the number of unready nodes is too high and it is not safe to continute scaling operations. It makes this determination by checking that the number of ready nodes is greater than the minimum ready count (default of 3) and the ratio of unready to ready nodes is less than the maximum unready node percentage (default of 45%). If either of those conditions are not true then the cluster autoscaler will enter an unsafe to scale state until the conditions change. summary: Cluster Autoscaler is reporting that the cluster is not ready for scaling expr: cluster_autoscaler_cluster_safe_to_autoscale{service="cluster-autoscaler-default"} != 1 for: 15m labels: severity: warning - alert: ClusterAutoscalerUnableToScaleCPULimitReached annotations: description: |- The number of total cores in the cluster has exceeded the maximum number set on the cluster autoscaler. This is calculated by summing the cpu capacity for all nodes in the cluster and comparing that number against the maximum cores value set for the cluster autoscaler (default 320000 cores). summary: Cluster Autoscaler has reached its CPU core limit and is unable to scale out expr: cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"} for: 15m labels: severity: info - alert: ClusterAutoscalerUnableToScaleMemoryLimitReached annotations: description: |- The number of total bytes of RAM in the cluster has exceeded the maximum number set on the cluster autoscaler. This is calculated by summing the memory capacity for all nodes in the cluster and comparing that number against the maximum memory bytes value set for the cluster autoscaler (default 6400000 gigabytes). summary: Cluster Autoscaler has reached its Memory bytes limit and is unable to scale out expr: cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"} for: 15m labels: severity: info liuhuali@Lius-MacBook-Pro huali-test %
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056