Description of problem: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-07-175228 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: $ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:' - alert: APIRemovedInNextReleaseInUse annotations: message: Deprecated API that will be removed in the next version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. expr: | group(apiserver_requested_deprecated_apis{removed_release="1.22"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0 for: 1h -- - alert: APIRemovedInNextEUSReleaseInUse annotations: message: Deprecated API that will be removed in the next EUS version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next EUS cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. expr: | group(apiserver_requested_deprecated_apis{removed_release=~"1\\.2[123]"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0 for: 1h -- - alert: HighOverallControlPlaneCPU annotations: message: Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain; a single control plane node -- - alert: ExtremelyHighIndividualControlPlaneCPU annotations: message: Extreme CPU pressure can cause slow serialization and poor performance from the kube-apiserver and etcd. When this happens, there is a risk of clients seeing non-responsive API requests which are issued again causing even more CPU pressure. It can also cause failing liveness probes due to slow etcd responsiveness on the backend. If one kube-apiserver fails under this condition, chances are you will experience a cascade as the remaining kube-apiservers are also under-provisioned. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization on a single control plane node is very high, more -- Expected results: alert rules have annotations "summary" and "description" Additional info: the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1] [1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required
The following rule has issue also $ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:11:59Z" generation: 1 name: kube-apiserver-operator namespace: openshift-kube-apiserver-operator ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1674" uid: ca725633-cd62-4d9e-a9f6-c9f4b260e98d spec: groups: - name: cluster-version rules: - alert: TechPreviewNoUpgrade annotations: message: Cluster has enabled tech preview features that will prevent upgrades. expr: | cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0 for: 10m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: ""
You can find more of these like this: oc -n namespace get PrometheusRule -o json | \ jq '.items[]|{namespace: .metadata.namespace,PrometheusRule: "\(.metadata.namespace)/\(.metadata.name)",alert: (..|objects|select(has("alert"))|select(.annotations|(has("description") and has("summary"))|not)|{name:.alert,summary: .annotations|has("summary"),description: .annotations|has("description"),message: .annotations|has("message")})}' Use either -n with a specific namespace or --all-namespaces. Only considering those in the openshift-kube-apiserver and openshift-kube-apiserver-operator namespaces in scope for this bug.
Tested with PR
Ignore #C3, put wrong comments here.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-09-21-181111 True False 2m42s Cluster version is 4.10.0-0.nightly-2021-09-21-181111 $ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:' - alert: APIRemovedInNextReleaseInUse annotations: description: Deprecated API that will be removed in the next version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. summary: Deprecated API that will be removed in the next version is being used. expr: | -- - alert: APIRemovedInNextEUSReleaseInUse annotations: description: Deprecated API that will be removed in the next EUS version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next EUS cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. summary: Deprecated API that will be removed in the next EUS version is being used. expr: | ... -- - alert: HighOverallControlPlaneCPU annotations: description: Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain; a single control plane node -- - alert: ExtremelyHighIndividualControlPlaneCPU annotations: description: Extreme CPU pressure can cause slow serialization and poor performance from the kube-apiserver and etcd. When this happens, there is a risk of clients seeing non-responsive API requests which are issued again causing even more CPU pressure. It can also cause failing liveness probes due to slow etcd responsiveness on the backend. If one kube-apiserver fails under this condition, chances are you will experience a cascade as the remaining kube-apiservers are also under-provisioned. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization on a single control plane node is very high, more -- $ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: ... name: kube-apiserver-operator namespace: openshift-kube-apiserver-operator ... rules: - alert: TechPreviewNoUpgrade annotations: description: Cluster has enabled Technology Preview features that cannot be undone and will prevent upgrades. The TechPreviewNoUpgrade feature set is not recommended on production clusters. summary: Cluster has enabled tech preview features that will prevent upgrades. expr: | cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0 for: 10m ... Based on above results, the bug was fixed, move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
It looks like you're facing an issue where the annotations in your OpenShift alert rules, specifically the "summary" and "description" fields, do not comply with the OpenShift alerting guidelines. Steps to troubleshoot and resolve the issue: Review the current alert rules: You’ve already identified and provided examples of alert rules that use annotations such as summary and message instead of summary and description. The existing rules may not fully align with OpenShift's guidelines. Update annotations to align with the guidelines: According to the OpenShift alerting consistency guidelines, alerts should contain clear and concise summary and description fields. The summary field should briefly explain the issue, while the description should provide more detailed context and potential remediation steps. Modify the alert rules: Based on the YAML snippets provided, here’s how you can update the alert rule to include both a summary and description: yaml - alert: APIRemovedInNextReleaseInUse annotations: summary: Deprecated API in use. description: Deprecated API {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} will be removed in the next version. Ensure that the workload using this API is updated for a successful upgrade. message: Deprecated API that will be removed in the next version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. Similarly, for other alerts such as APIRemovedInNextEUSReleaseInUsehttps://geometry-dashmeltdown.co, the same pattern of updating summary and description can be applied. Validate the changes: After updating the alert rules, validate the Prometheus rules and ensure that all fields comply with the OpenShift guidelines. Follow the OpenShift guidelines: Reference the OpenShift alerting consistency guidelines, especially for the required documentation: Alerting Consistency. Deploy the updated rules: Once the rules are updated and validated, apply them to your cluster using the appropriate command: bash oc apply -f updated-alert-rules.yaml Key points from the OpenShift alerting guidelines: Summary: A concise statement of the issue. Description: A detailed explanation with any context that could help the user understand and resolve the issue. By ensuring your alert rules follow these guidelines, you can improve the clarity and consistency of alerts in your OpenShift environment.
(In reply to FrancisMoses from comment #11) > It looks like you're facing an issue where the annotations in your OpenShift > alert rules, specifically the "summary" and "description" fields, do not > comply with the OpenShift alerting guidelines. > > Steps to troubleshoot and resolve the issue: > Review the current alert rules: > > You’ve already identified and provided examples of alert rules that use > annotations such as summary and message instead of summary and description. > The existing rules may not fully align with OpenShift's guidelines. > Update annotations to align with the guidelines: > > According to the OpenShift alerting consistency guidelines, alerts should > contain clear and concise summary and description fields. > The summary field should briefly explain the issue, while the description > should provide more detailed context and potential remediation steps. > Modify the alert rules: > > Based on the YAML snippets provided, here’s how you can update the alert > rule to include both a summary and description: > yaml > - alert: APIRemovedInNextReleaseInUse > annotations: > summary: Deprecated API in use. > description: Deprecated API {{ $labels.group }}.{{ $labels.version }}/{{ > $labels.resource }} will be removed in the next version. Ensure that the > workload using this API is updated for a successful upgrade. > message: Deprecated API that will be removed in the next version is > being used. Removing the workload that is using the {{ $labels.group }}.{{ > $labels.version }}/{{ $labels.resource }} API might be necessary for a > successful upgrade to the next cluster version. Refer to `oc get > apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ > $labels.group }} -o yaml` to identify the workload. > Similarly, for other alerts such as > APIRemovedInNextEUSReleaseInUse https://geometry-dashmeltdown.co, the same > pattern of updating summary and description can be applied. > > Validate the changes: > > After updating the alert rules, validate the Prometheus rules and ensure > that all fields comply with the OpenShift guidelines. > Follow the OpenShift guidelines: > > Reference the OpenShift alerting consistency guidelines, especially for the > required documentation: Alerting Consistency. > Deploy the updated rules: > > Once the rules are updated and validated, apply them to your cluster using > the appropriate command: > bash > oc apply -f updated-alert-rules.yaml > Key points from the OpenShift alerting guidelines: > Summary: A concise statement of the issue. > Description: A detailed explanation with any context that could help the > user understand and resolve the issue. > By ensuring your alert rules follow these guidelines, you can improve the > clarity and consistency of alerts in your OpenShift environment.
This comment was flagged a spam, view the edit history to see the original text if required.