Bug 2026178 - OpenShift Alerting Rules Style-Guide Compliance
Summary: OpenShift Alerting Rules Style-Guide Compliance
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.10.0
Assignee: Michael McCune
QA Contact: Huali Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-24 01:40 UTC by Huali Liu
Modified: 2022-03-12 04:39 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:39:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-autoscaler-operator pull 233 0 None open Bug 2026178: update alerts to match style guidance 2021-11-24 14:44:06 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:39:16 UTC

Description Huali Liu 2021-11-24 01:40:07 UTC
Description of problem:
Autoscaler alerts without summary and description annotations

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-11-22-195410

How reproducible:
Always

Steps to Reproduce:

liuhuali@Lius-MacBook-Pro ~ % oc create -f clusterautoscaler.yaml
clusterautoscaler.autoscaling.openshift.io/default created
liuhuali@Lius-MacBook-Pro ~ % cat clusterautoscaler.yaml 
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    memory:
      min: 4
      max: 128
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s 

liuhuali@Lius-MacBook-Pro ~ % oc -n openshift-machine-api get prometheusrule
NAME                                    AGE
cluster-autoscaler-default              2m32s
machine-api-operator-prometheus-rules   62m
liuhuali@Lius-MacBook-Pro ~ % oc -n openshift-machine-api get prometheusrule cluster-autoscaler-default -o yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2021-11-23T14:44:28Z"
  generation: 1
  labels:
    prometheus: k8s
    role: alert-rules
  name: cluster-autoscaler-default
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: autoscaling.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterAutoscaler
    name: default
    uid: 057f8bf9-ec0b-4fc8-8ab2-a3afb40a0251
  resourceVersion: "37912"
  uid: 3a070375-df61-43e8-b9e2-acaf45dfcadd
spec:
  groups:
  - name: general.rules
    rules:
    - alert: ClusterAutoscalerUnschedulablePods
      annotations:
        message: Cluster Autoscaler has {{ $value }} unschedulable pods
      expr: cluster_autoscaler_unschedulable_pods_count{service="cluster-autoscaler-default"}
        > 0
      for: 20m
      labels:
        severity: info
    - alert: ClusterAutoscalerNotSafeToScale
      annotations:
        message: Cluster Autoscaler is reporting that the cluster is not ready for
          scaling
      expr: cluster_autoscaler_cluster_safe_to_autoscale{service="cluster-autoscaler-default"}
        != 1
      for: 15m
      labels:
        severity: warning
    - alert: ClusterAutoscalerUnableToScaleCPULimitReached
      annotations:
        message: Cluster Autoscaler has reached its CPU core limit and is unable to
          scale out
      expr: cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"}
      for: 15m
      labels:
        severity: info
    - alert: ClusterAutoscalerUnableToScaleMemoryLimitReached
      annotations:
        message: Cluster Autoscaler has reached its Memory bytes limit and is unable
          to scale out
      expr: cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"}
      for: 15m
      labels:
        severity: info
liuhuali@Lius-MacBook-Pro ~ %


Actual results:

Alerts without summary and description annotations:
  - ClusterAutoscalerUnschedulablePods
  - ClusterAutoscalerNotSafeToScale
  - ClusterAutoscalerUnableToScaleCPULimitReached
  - ClusterAutoscalerUnableToScaleMemoryLimitReached


Expected results:

Alerts MUST include summary and description annotations.

Additional info:

https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md
https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide

Comment 1 Huali Liu 2021-11-26 05:32:56 UTC
Set up cluster using cluster-bot with https://github.com/openshift/cluster-autoscaler-operator/pull/233

Verified the four Alerts with summary and description now:
  - ClusterAutoscalerUnschedulablePods
  - ClusterAutoscalerNotSafeToScale
  - ClusterAutoscalerUnableToScaleCPULimitReached
  - ClusterAutoscalerUnableToScaleMemoryLimitReached

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.ci.test-2021-11-26-035937-ci-ln-ptd866t-latest   True        False         2m7s    Cluster version is 4.10.0-0.ci.test-2021-11-26-035937-ci-ln-ptd866t-latest
liuhuali@Lius-MacBook-Pro huali-test % oc create -f clusterautoscale.yaml 
clusterautoscaler.autoscaling.openshift.io/default created
liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-machine-api get prometheusrule 
NAME                                    AGE
cluster-autoscaler-default              51m
machine-api-operator-prometheus-rules   83m
liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-machine-api get prometheusrule cluster-autoscaler-default -o yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2021-11-26T04:39:25Z"
  generation: 1
  labels:
    prometheus: k8s
    role: alert-rules
  name: cluster-autoscaler-default
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: autoscaling.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterAutoscaler
    name: default
    uid: a2dce283-5acc-4d18-aa77-eb73902ef24b
  resourceVersion: "34050"
  uid: 2678c69c-b9f9-47c2-aaf6-577da492e0e8
spec:
  groups:
  - name: general.rules
    rules:
    - alert: ClusterAutoscalerUnschedulablePods
      annotations:
        description: |-
          The cluster autoscaler is unable to scale up and is alerting that there are unschedulable pods because of this condition.
          This may be caused by the cluster autoscaler reaching its resources limits, or by Kubernetes waiting for new nodes to become ready.
        summary: Cluster Autoscaler has {{ $value }} unschedulable pods
      expr: cluster_autoscaler_unschedulable_pods_count{service="cluster-autoscaler-default"}
        > 0
      for: 20m
      labels:
        severity: info
    - alert: ClusterAutoscalerNotSafeToScale
      annotations:
        description: |-
          The cluster autoscaler has detected that the number of unready nodes is too high
          and it is not safe to continute scaling operations. It makes this determination by checking that the number of ready nodes is greater than the minimum ready count
          (default of 3) and the ratio of unready to ready nodes is less than the maximum unready node percentage (default of 45%). If either of those conditions are not
          true then the cluster autoscaler will enter an unsafe to scale state until the conditions change.
        summary: Cluster Autoscaler is reporting that the cluster is not ready for
          scaling
      expr: cluster_autoscaler_cluster_safe_to_autoscale{service="cluster-autoscaler-default"}
        != 1
      for: 15m
      labels:
        severity: warning
    - alert: ClusterAutoscalerUnableToScaleCPULimitReached
      annotations:
        description: |-
          The number of total cores in the cluster has exceeded the maximum number set on the
          cluster autoscaler. This is calculated by summing the cpu capacity for all nodes in the cluster and comparing that number against the maximum cores value set for the
          cluster autoscaler (default 320000 cores).
        summary: Cluster Autoscaler has reached its CPU core limit and is unable to
          scale out
      expr: cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"}
      for: 15m
      labels:
        severity: info
    - alert: ClusterAutoscalerUnableToScaleMemoryLimitReached
      annotations:
        description: |-
          The number of total bytes of RAM in the cluster has exceeded the maximum number set on
          the cluster autoscaler. This is calculated by summing the memory capacity for all nodes in the cluster and comparing that number against the maximum memory bytes value set
          for the cluster autoscaler (default 6400000 gigabytes).
        summary: Cluster Autoscaler has reached its Memory bytes limit and is unable
          to scale out
      expr: cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"}
      for: 15m
      labels:
        severity: info
liuhuali@Lius-MacBook-Pro huali-test %

Comment 7 errata-xmlrpc 2022-03-12 04:39:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.