Bug 1992541
| Summary: | all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | hongyan li <hongyli> |
| Component: | openshift-apiserver | Assignee: | Luis Sanchez <sanchezl> |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.9 | CC: | aos-bugs, joedavenportjd25, kewang, mfojtik |
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-10 16:05:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The following rule has issue also
$ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
exclude.release.openshift.io/internal-openshift-hosted: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
creationTimestamp: "2021-08-10T23:11:59Z"
generation: 1
name: kube-apiserver-operator
namespace: openshift-kube-apiserver-operator
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
resourceVersion: "1674"
uid: ca725633-cd62-4d9e-a9f6-c9f4b260e98d
spec:
groups:
- name: cluster-version
rules:
- alert: TechPreviewNoUpgrade
annotations:
message: Cluster has enabled tech preview features that will prevent upgrades.
expr: |
cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0
for: 10m
labels:
severity: warning
kind: List
metadata:
resourceVersion: ""
selfLink: ""
You can find more of these like this:
oc -n namespace get PrometheusRule -o json | \
jq '.items[]|{namespace: .metadata.namespace,PrometheusRule: "\(.metadata.namespace)/\(.metadata.name)",alert: (..|objects|select(has("alert"))|select(.annotations|(has("description") and has("summary"))|not)|{name:.alert,summary: .annotations|has("summary"),description: .annotations|has("description"),message: .annotations|has("message")})}'
Use either -n with a specific namespace or --all-namespaces.
Only considering those in the openshift-kube-apiserver and openshift-kube-apiserver-operator namespaces in scope for this bug.
Tested with PR Ignore #C3, put wrong comments here. $ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-0.nightly-2021-09-21-181111 True False 2m42s Cluster version is 4.10.0-0.nightly-2021-09-21-181111
$ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:'
- alert: APIRemovedInNextReleaseInUse
annotations:
description: Deprecated API that will be removed in the next version is
being used. Removing the workload that is using the {{ $labels.group }}.{{
$labels.version }}/{{ $labels.resource }} API might be necessary for a
successful upgrade to the next cluster version. Refer to `oc get apirequestcounts
{{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml`
to identify the workload.
summary: Deprecated API that will be removed in the next version is being
used.
expr: |
--
- alert: APIRemovedInNextEUSReleaseInUse
annotations:
description: Deprecated API that will be removed in the next EUS version
is being used. Removing the workload that is using the {{ $labels.group
}}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary
for a successful upgrade to the next EUS cluster version. Refer to `oc
get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group
}} -o yaml` to identify the workload.
summary: Deprecated API that will be removed in the next EUS version is
being used.
expr: |
...
--
- alert: HighOverallControlPlaneCPU
annotations:
description: Given three control plane nodes, the overall CPU utilization
may only be about 2/3 of all available capacity. This is because if a
single control plane node fails, the remaining two must handle the load
of the cluster in order to be HA. If the cluster is using more than 2/3
of all capacity, if one control plane node fails, the remaining two are
likely to fail when they take the load. To fix this, increase the CPU
and memory on your control plane nodes.
summary: CPU utilization across all three control plane nodes is higher
than two control plane nodes can sustain; a single control plane node
--
- alert: ExtremelyHighIndividualControlPlaneCPU
annotations:
description: Extreme CPU pressure can cause slow serialization and poor
performance from the kube-apiserver and etcd. When this happens, there
is a risk of clients seeing non-responsive API requests which are issued
again causing even more CPU pressure. It can also cause failing liveness
probes due to slow etcd responsiveness on the backend. If one kube-apiserver
fails under this condition, chances are you will experience a cascade
as the remaining kube-apiservers are also under-provisioned. To fix this,
increase the CPU and memory on your control plane nodes.
summary: CPU utilization on a single control plane node is very high, more
--
$ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
...
name: kube-apiserver-operator
namespace: openshift-kube-apiserver-operator
...
rules:
- alert: TechPreviewNoUpgrade
annotations:
description: Cluster has enabled Technology Preview features that cannot
be undone and will prevent upgrades. The TechPreviewNoUpgrade feature
set is not recommended on production clusters.
summary: Cluster has enabled tech preview features that will prevent upgrades.
expr: |
cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0
for: 10m
...
Based on above results, the bug was fixed, move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
It looks like you're facing an issue where the annotations in your OpenShift alert rules, specifically the "summary" and "description" fields, do not comply with the OpenShift alerting guidelines.
Steps to troubleshoot and resolve the issue:
Review the current alert rules:
You’ve already identified and provided examples of alert rules that use annotations such as summary and message instead of summary and description.
The existing rules may not fully align with OpenShift's guidelines.
Update annotations to align with the guidelines:
According to the OpenShift alerting consistency guidelines, alerts should contain clear and concise summary and description fields.
The summary field should briefly explain the issue, while the description should provide more detailed context and potential remediation steps.
Modify the alert rules:
Based on the YAML snippets provided, here’s how you can update the alert rule to include both a summary and description:
yaml
- alert: APIRemovedInNextReleaseInUse
annotations:
summary: Deprecated API in use.
description: Deprecated API {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} will be removed in the next version. Ensure that the workload using this API is updated for a successful upgrade.
message: Deprecated API that will be removed in the next version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload.
Similarly, for other alerts such as APIRemovedInNextEUSReleaseInUsehttps://geometry-dashmeltdown.co, the same pattern of updating summary and description can be applied.
Validate the changes:
After updating the alert rules, validate the Prometheus rules and ensure that all fields comply with the OpenShift guidelines.
Follow the OpenShift guidelines:
Reference the OpenShift alerting consistency guidelines, especially for the required documentation: Alerting Consistency.
Deploy the updated rules:
Once the rules are updated and validated, apply them to your cluster using the appropriate command:
bash
oc apply -f updated-alert-rules.yaml
Key points from the OpenShift alerting guidelines:
Summary: A concise statement of the issue.
Description: A detailed explanation with any context that could help the user understand and resolve the issue.
By ensuring your alert rules follow these guidelines, you can improve the clarity and consistency of alerts in your OpenShift environment.
(In reply to FrancisMoses from comment #11) > It looks like you're facing an issue where the annotations in your OpenShift > alert rules, specifically the "summary" and "description" fields, do not > comply with the OpenShift alerting guidelines. > > Steps to troubleshoot and resolve the issue: > Review the current alert rules: > > You’ve already identified and provided examples of alert rules that use > annotations such as summary and message instead of summary and description. > The existing rules may not fully align with OpenShift's guidelines. > Update annotations to align with the guidelines: > > According to the OpenShift alerting consistency guidelines, alerts should > contain clear and concise summary and description fields. > The summary field should briefly explain the issue, while the description > should provide more detailed context and potential remediation steps. > Modify the alert rules: > > Based on the YAML snippets provided, here’s how you can update the alert > rule to include both a summary and description: > yaml > - alert: APIRemovedInNextReleaseInUse > annotations: > summary: Deprecated API in use. > description: Deprecated API {{ $labels.group }}.{{ $labels.version }}/{{ > $labels.resource }} will be removed in the next version. Ensure that the > workload using this API is updated for a successful upgrade. > message: Deprecated API that will be removed in the next version is > being used. Removing the workload that is using the {{ $labels.group }}.{{ > $labels.version }}/{{ $labels.resource }} API might be necessary for a > successful upgrade to the next cluster version. Refer to `oc get > apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ > $labels.group }} -o yaml` to identify the workload. > Similarly, for other alerts such as > APIRemovedInNextEUSReleaseInUse https://geometry-dashmeltdown.co, the same > pattern of updating summary and description can be applied. > > Validate the changes: > > After updating the alert rules, validate the Prometheus rules and ensure > that all fields comply with the OpenShift guidelines. > Follow the OpenShift guidelines: > > Reference the OpenShift alerting consistency guidelines, especially for the > required documentation: Alerting Consistency. > Deploy the updated rules: > > Once the rules are updated and validated, apply them to your cluster using > the appropriate command: > bash > oc apply -f updated-alert-rules.yaml > Key points from the OpenShift alerting guidelines: > Summary: A concise statement of the issue. > Description: A detailed explanation with any context that could help the > user understand and resolve the issue. > By ensuring your alert rules follow these guidelines, you can improve the > clarity and consistency of alerts in your OpenShift environment. This comment was flagged a spam, view the edit history to see the original text if required. |
Description of problem: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-07-175228 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: $ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:' - alert: APIRemovedInNextReleaseInUse annotations: message: Deprecated API that will be removed in the next version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. expr: | group(apiserver_requested_deprecated_apis{removed_release="1.22"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0 for: 1h -- - alert: APIRemovedInNextEUSReleaseInUse annotations: message: Deprecated API that will be removed in the next EUS version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next EUS cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. expr: | group(apiserver_requested_deprecated_apis{removed_release=~"1\\.2[123]"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0 for: 1h -- - alert: HighOverallControlPlaneCPU annotations: message: Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain; a single control plane node -- - alert: ExtremelyHighIndividualControlPlaneCPU annotations: message: Extreme CPU pressure can cause slow serialization and poor performance from the kube-apiserver and etcd. When this happens, there is a risk of clients seeing non-responsive API requests which are issued again causing even more CPU pressure. It can also cause failing liveness probes due to slow etcd responsiveness on the backend. If one kube-apiserver fails under this condition, chances are you will experience a cascade as the remaining kube-apiservers are also under-provisioned. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization on a single control plane node is very high, more -- Expected results: alert rules have annotations "summary" and "description" Additional info: the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1] [1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required