1992541 – all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines

Bug 1992541 - all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines

Summary: all the alert rules' annotations "summary" and "description" should comply wi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Luis Sanchez
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-11 09:49 UTC by hongyan li
Modified:	2024-09-24 16:34 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:05:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 1215	0	None	None	None	2021-08-27 01:32:02 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:05:28 UTC

Description hongyan li 2021-08-11 09:49:17 UTC

Description of problem:
all the alert rules'  annotations "summary" and "description"  should comply with the OpenShift alerting guidelines

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-07-175228

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:
$ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:'
      - alert: APIRemovedInNextReleaseInUse
        annotations:
          message: Deprecated API that will be removed in the next version is being
            used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version
            }}/{{ $labels.resource }} API might be necessary for a successful upgrade
            to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource
            }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the
            workload.
        expr: |
          group(apiserver_requested_deprecated_apis{removed_release="1.22"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0
        for: 1h
--
      - alert: APIRemovedInNextEUSReleaseInUse
        annotations:
          message: Deprecated API that will be removed in the next EUS version is
            being used. Removing the workload that is using the {{ $labels.group }}.{{
            $labels.version }}/{{ $labels.resource }} API might be necessary for a
            successful upgrade to the next EUS cluster version. Refer to `oc get apirequestcounts
            {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml`
            to identify the workload.
        expr: |
          group(apiserver_requested_deprecated_apis{removed_release=~"1\\.2[123]"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0
        for: 1h
--
      - alert: HighOverallControlPlaneCPU
        annotations:
          message: Given three control plane nodes, the overall CPU utilization may
            only be about 2/3 of all available capacity. This is because if a single
            control plane node fails, the remaining two must handle the load of the
            cluster in order to be HA. If the cluster is using more than 2/3 of all
            capacity, if one control plane node fails, the remaining two are likely
            to fail when they take the load. To fix this, increase the CPU and memory
            on your control plane nodes.
          summary: CPU utilization across all three control plane nodes is higher
            than two control plane nodes can sustain; a single control plane node
--
      - alert: ExtremelyHighIndividualControlPlaneCPU
        annotations:
          message: Extreme CPU pressure can cause slow serialization and poor performance
            from the kube-apiserver and etcd. When this happens, there is a risk of
            clients seeing non-responsive API requests which are issued again causing
            even more CPU pressure. It can also cause failing liveness probes due
            to slow etcd responsiveness on the backend. If one kube-apiserver fails
            under this condition, chances are you will experience a cascade as the
            remaining kube-apiservers are also under-provisioned. To fix this, increase
            the CPU and memory on your control plane nodes.
          summary: CPU utilization on a single control plane node is very high, more
--



Expected results:
alert rules have annotations "summary" and "description"

Additional info:
the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1]

[1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required

Comment 1 hongyan li 2021-08-11 09:51:56 UTC

The following rule has issue also
$ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      exclude.release.openshift.io/internal-openshift-hosted: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:11:59Z"
    generation: 1
    name: kube-apiserver-operator
    namespace: openshift-kube-apiserver-operator
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "1674"
    uid: ca725633-cd62-4d9e-a9f6-c9f4b260e98d
  spec:
    groups:
    - name: cluster-version
      rules:
      - alert: TechPreviewNoUpgrade
        annotations:
          message: Cluster has enabled tech preview features that will prevent upgrades.
        expr: |
          cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0
        for: 10m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 2 Luis Sanchez 2021-08-27 01:37:14 UTC

You can find more of these like this:

oc -n namespace get PrometheusRule -o json | \
jq '.items[]|{namespace: .metadata.namespace,PrometheusRule: "\(.metadata.namespace)/\(.metadata.name)",alert: (..|objects|select(has("alert"))|select(.annotations|(has("description") and has("summary"))|not)|{name:.alert,summary: .annotations|has("summary"),description: .annotations|has("description"),message: .annotations|has("message")})}'

Use either -n with a specific namespace or --all-namespaces.

Only considering those in the openshift-kube-apiserver and openshift-kube-apiserver-operator namespaces in scope for this bug.

Comment 3 hongyan li 2021-08-27 03:04:50 UTC

Tested with PR

Comment 4 hongyan li 2021-08-27 03:09:05 UTC

Ignore #C3, put wrong comments here.

Comment 7 Ke Wang 2021-09-22 03:53:37 UTC

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-0.nightly-2021-09-21-181111 True False 2m42s Cluster version is 4.10.0-0.nightly-2021-09-21-181111

$ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:'
- alert: APIRemovedInNextReleaseInUse
annotations:
description: Deprecated API that will be removed in the next version is
being used. Removing the workload that is using the {{ $labels.group }}.{{
$labels.version }}/{{ $labels.resource }} API might be necessary for a
successful upgrade to the next cluster version. Refer to `oc get apirequestcounts
{{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml`
to identify the workload.
summary: Deprecated API that will be removed in the next version is being
used.
expr: |
--
- alert: APIRemovedInNextEUSReleaseInUse
annotations:
description: Deprecated API that will be removed in the next EUS version
is being used. Removing the workload that is using the {{ $labels.group
}}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary
for a successful upgrade to the next EUS cluster version. Refer to `oc
get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group
}} -o yaml` to identify the workload.
summary: Deprecated API that will be removed in the next EUS version is
being used.
expr: |
...
--
- alert: HighOverallControlPlaneCPU
annotations:
description: Given three control plane nodes, the overall CPU utilization
may only be about 2/3 of all available capacity. This is because if a
single control plane node fails, the remaining two must handle the load
of the cluster in order to be HA. If the cluster is using more than 2/3
of all capacity, if one control plane node fails, the remaining two are
likely to fail when they take the load. To fix this, increase the CPU
and memory on your control plane nodes.
summary: CPU utilization across all three control plane nodes is higher
than two control plane nodes can sustain; a single control plane node
--
- alert: ExtremelyHighIndividualControlPlaneCPU
annotations:
description: Extreme CPU pressure can cause slow serialization and poor
performance from the kube-apiserver and etcd. When this happens, there
is a risk of clients seeing non-responsive API requests which are issued
again causing even more CPU pressure. It can also cause failing liveness
probes due to slow etcd responsiveness on the backend. If one kube-apiserver
fails under this condition, chances are you will experience a cascade
as the remaining kube-apiservers are also under-provisioned. To fix this,
increase the CPU and memory on your control plane nodes.
summary: CPU utilization on a single control plane node is very high, more
--

$ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
...
name: kube-apiserver-operator
namespace: openshift-kube-apiserver-operator
...
rules:
- alert: TechPreviewNoUpgrade
annotations:
description: Cluster has enabled Technology Preview features that cannot
be undone and will prevent upgrades. The TechPreviewNoUpgrade feature
set is not recommended on production clusters.
summary: Cluster has enabled tech preview features that will prevent upgrades.
expr: |
cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0
for: 10m
...

Based on above results, the bug was fixed, move the bug VERIFIED.

Comment 10 errata-xmlrpc 2022-03-10 16:05:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 11 FrancisMoses 2024-09-17 07:29:11 UTC Comment hidden (spam)

It looks like you're facing an issue where the annotations in your OpenShift alert rules, specifically the "summary" and "description" fields, do not comply with the OpenShift alerting guidelines.

Steps to troubleshoot and resolve the issue:
Review the current alert rules:

You’ve already identified and provided examples of alert rules that use annotations such as summary and message instead of summary and description.
The existing rules may not fully align with OpenShift's guidelines.
Update annotations to align with the guidelines:

According to the OpenShift alerting consistency guidelines, alerts should contain clear and concise summary and description fields.
The summary field should briefly explain the issue, while the description should provide more detailed context and potential remediation steps.
Modify the alert rules:

Based on the YAML snippets provided, here’s how you can update the alert rule to include both a summary and description:
yaml
- alert: APIRemovedInNextReleaseInUse
annotations:
summary: Deprecated API in use.
description: Deprecated API {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} will be removed in the next version. Ensure that the workload using this API is updated for a successful upgrade.
message: Deprecated API that will be removed in the next version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload.
Similarly, for other alerts such as APIRemovedInNextEUSReleaseInUsehttps://geometry-dashmeltdown.co, the same pattern of updating summary and description can be applied.

Validate the changes:

After updating the alert rules, validate the Prometheus rules and ensure that all fields comply with the OpenShift guidelines.
Follow the OpenShift guidelines:

Reference the OpenShift alerting consistency guidelines, especially for the required documentation: Alerting Consistency.
Deploy the updated rules:

Once the rules are updated and validated, apply them to your cluster using the appropriate command:
bash
oc apply -f updated-alert-rules.yaml
Key points from the OpenShift alerting guidelines:
Summary: A concise statement of the issue.
Description: A detailed explanation with any context that could help the user understand and resolve the issue.
By ensuring your alert rules follow these guidelines, you can improve the clarity and consistency of alerts in your OpenShift environment.

Comment 12 FrancisMoses 2024-09-17 07:30:18 UTC Comment hidden (spam)

(In reply to FrancisMoses from comment #11)
> It looks like you're facing an issue where the annotations in your OpenShift
> alert rules, specifically the "summary" and "description" fields, do not
> comply with the OpenShift alerting guidelines.
> 
> Steps to troubleshoot and resolve the issue:
> Review the current alert rules:
> 
> You’ve already identified and provided examples of alert rules that use
> annotations such as summary and message instead of summary and description.
> The existing rules may not fully align with OpenShift's guidelines.
> Update annotations to align with the guidelines:
> 
> According to the OpenShift alerting consistency guidelines, alerts should
> contain clear and concise summary and description fields.
> The summary field should briefly explain the issue, while the description
> should provide more detailed context and potential remediation steps.
> Modify the alert rules:
> 
> Based on the YAML snippets provided, here’s how you can update the alert
> rule to include both a summary and description:
> yaml
> - alert: APIRemovedInNextReleaseInUse
>   annotations:
>     summary: Deprecated API in use.
>     description: Deprecated API {{ $labels.group }}.{{ $labels.version }}/{{
> $labels.resource }} will be removed in the next version. Ensure that the
> workload using this API is updated for a successful upgrade.
>     message: Deprecated API that will be removed in the next version is
> being used. Removing the workload that is using the {{ $labels.group }}.{{
> $labels.version }}/{{ $labels.resource }} API might be necessary for a
> successful upgrade to the next cluster version. Refer to `oc get
> apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{
> $labels.group }} -o yaml` to identify the workload.
> Similarly, for other alerts such as
> APIRemovedInNextEUSReleaseInUse https://geometry-dashmeltdown.co, the same
> pattern of updating summary and description can be applied.
> 
> Validate the changes:
> 
> After updating the alert rules, validate the Prometheus rules and ensure
> that all fields comply with the OpenShift guidelines.
> Follow the OpenShift guidelines:
> 
> Reference the OpenShift alerting consistency guidelines, especially for the
> required documentation: Alerting Consistency.
> Deploy the updated rules:
> 
> Once the rules are updated and validated, apply them to your cluster using
> the appropriate command:
> bash
> oc apply -f updated-alert-rules.yaml
> Key points from the OpenShift alerting guidelines:
> Summary: A concise statement of the issue.
> Description: A detailed explanation with any context that could help the
> user understand and resolve the issue.
> By ensuring your alert rules follow these guidelines, you can improve the
> clarity and consistency of alerts in your OpenShift environment.

Comment 13 AndresWeaver 2024-09-24 16:23:07 UTC Comment hidden (spam)

This comment was flagged a spam, view the edit history to see the original text if required.

Note You need to log in before you can comment on or make changes to this bug.