Bug 1992541 - all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines
Summary: all the alert rules' annotations "summary" and "description" should comply wi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.10.0
Assignee: Luis Sanchez
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-11 09:49 UTC by hongyan li
Modified: 2022-03-10 16:05 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:05:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1215 0 None None None 2021-08-27 01:32:02 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:05:28 UTC

Description hongyan li 2021-08-11 09:49:17 UTC
Description of problem:
all the alert rules'  annotations "summary" and "description"  should comply with the OpenShift alerting guidelines

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-07-175228

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:
$ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:'
      - alert: APIRemovedInNextReleaseInUse
        annotations:
          message: Deprecated API that will be removed in the next version is being
            used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version
            }}/{{ $labels.resource }} API might be necessary for a successful upgrade
            to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource
            }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the
            workload.
        expr: |
          group(apiserver_requested_deprecated_apis{removed_release="1.22"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0
        for: 1h
--
      - alert: APIRemovedInNextEUSReleaseInUse
        annotations:
          message: Deprecated API that will be removed in the next EUS version is
            being used. Removing the workload that is using the {{ $labels.group }}.{{
            $labels.version }}/{{ $labels.resource }} API might be necessary for a
            successful upgrade to the next EUS cluster version. Refer to `oc get apirequestcounts
            {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml`
            to identify the workload.
        expr: |
          group(apiserver_requested_deprecated_apis{removed_release=~"1\\.2[123]"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0
        for: 1h
--
      - alert: HighOverallControlPlaneCPU
        annotations:
          message: Given three control plane nodes, the overall CPU utilization may
            only be about 2/3 of all available capacity. This is because if a single
            control plane node fails, the remaining two must handle the load of the
            cluster in order to be HA. If the cluster is using more than 2/3 of all
            capacity, if one control plane node fails, the remaining two are likely
            to fail when they take the load. To fix this, increase the CPU and memory
            on your control plane nodes.
          summary: CPU utilization across all three control plane nodes is higher
            than two control plane nodes can sustain; a single control plane node
--
      - alert: ExtremelyHighIndividualControlPlaneCPU
        annotations:
          message: Extreme CPU pressure can cause slow serialization and poor performance
            from the kube-apiserver and etcd. When this happens, there is a risk of
            clients seeing non-responsive API requests which are issued again causing
            even more CPU pressure. It can also cause failing liveness probes due
            to slow etcd responsiveness on the backend. If one kube-apiserver fails
            under this condition, chances are you will experience a cascade as the
            remaining kube-apiservers are also under-provisioned. To fix this, increase
            the CPU and memory on your control plane nodes.
          summary: CPU utilization on a single control plane node is very high, more
--



Expected results:
alert rules have annotations "summary" and "description"

Additional info:
the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1]

[1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required

Comment 1 hongyan li 2021-08-11 09:51:56 UTC
The following rule has issue also
$ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      exclude.release.openshift.io/internal-openshift-hosted: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:11:59Z"
    generation: 1
    name: kube-apiserver-operator
    namespace: openshift-kube-apiserver-operator
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "1674"
    uid: ca725633-cd62-4d9e-a9f6-c9f4b260e98d
  spec:
    groups:
    - name: cluster-version
      rules:
      - alert: TechPreviewNoUpgrade
        annotations:
          message: Cluster has enabled tech preview features that will prevent upgrades.
        expr: |
          cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0
        for: 10m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 2 Luis Sanchez 2021-08-27 01:37:14 UTC
You can find more of these like this:

oc -n namespace get PrometheusRule -o json | \
jq '.items[]|{namespace: .metadata.namespace,PrometheusRule: "\(.metadata.namespace)/\(.metadata.name)",alert: (..|objects|select(has("alert"))|select(.annotations|(has("description") and has("summary"))|not)|{name:.alert,summary: .annotations|has("summary"),description: .annotations|has("description"),message: .annotations|has("message")})}'

Use either -n with a specific namespace or --all-namespaces.

Only considering those in the openshift-kube-apiserver and openshift-kube-apiserver-operator namespaces in scope for this bug.

Comment 3 hongyan li 2021-08-27 03:04:50 UTC
Tested with PR

Comment 4 hongyan li 2021-08-27 03:09:05 UTC
Ignore #C3, put wrong comments here.

Comment 7 Ke Wang 2021-09-22 03:53:37 UTC
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-09-21-181111   True        False         2m42s   Cluster version is 4.10.0-0.nightly-2021-09-21-181111

$ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:'
      - alert: APIRemovedInNextReleaseInUse
        annotations:
          description: Deprecated API that will be removed in the next version is
            being used. Removing the workload that is using the {{ $labels.group }}.{{
            $labels.version }}/{{ $labels.resource }} API might be necessary for a
            successful upgrade to the next cluster version. Refer to `oc get apirequestcounts
            {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml`
            to identify the workload.
          summary: Deprecated API that will be removed in the next version is being
            used.
        expr: |
--
      - alert: APIRemovedInNextEUSReleaseInUse
        annotations:
          description: Deprecated API that will be removed in the next EUS version
            is being used. Removing the workload that is using the {{ $labels.group
            }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary
            for a successful upgrade to the next EUS cluster version. Refer to `oc
            get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group
            }} -o yaml` to identify the workload.
          summary: Deprecated API that will be removed in the next EUS version is
            being used.
        expr: |
...
--
      - alert: HighOverallControlPlaneCPU
        annotations:
          description: Given three control plane nodes, the overall CPU utilization
            may only be about 2/3 of all available capacity. This is because if a
            single control plane node fails, the remaining two must handle the load
            of the cluster in order to be HA. If the cluster is using more than 2/3
            of all capacity, if one control plane node fails, the remaining two are
            likely to fail when they take the load. To fix this, increase the CPU
            and memory on your control plane nodes.
          summary: CPU utilization across all three control plane nodes is higher
            than two control plane nodes can sustain; a single control plane node
--
      - alert: ExtremelyHighIndividualControlPlaneCPU
        annotations:
          description: Extreme CPU pressure can cause slow serialization and poor
            performance from the kube-apiserver and etcd. When this happens, there
            is a risk of clients seeing non-responsive API requests which are issued
            again causing even more CPU pressure. It can also cause failing liveness
            probes due to slow etcd responsiveness on the backend. If one kube-apiserver
            fails under this condition, chances are you will experience a cascade
            as the remaining kube-apiservers are also under-provisioned. To fix this,
            increase the CPU and memory on your control plane nodes.
          summary: CPU utilization on a single control plane node is very high, more
--

$ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
   ...
    name: kube-apiserver-operator
    namespace: openshift-kube-apiserver-operator
    ...
      rules:
      - alert: TechPreviewNoUpgrade
        annotations:
          description: Cluster has enabled Technology Preview features that cannot
            be undone and will prevent upgrades. The TechPreviewNoUpgrade feature
            set is not recommended on production clusters.
          summary: Cluster has enabled tech preview features that will prevent upgrades.
        expr: |
          cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0
        for: 10m
...

Based on above results, the bug was fixed, move the bug VERIFIED.

Comment 10 errata-xmlrpc 2022-03-10 16:05:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.