Hide Forgot
Description of problem: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-07-175228 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: $ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:' - alert: APIRemovedInNextReleaseInUse annotations: message: Deprecated API that will be removed in the next version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. expr: | group(apiserver_requested_deprecated_apis{removed_release="1.22"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0 for: 1h -- - alert: APIRemovedInNextEUSReleaseInUse annotations: message: Deprecated API that will be removed in the next EUS version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next EUS cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. expr: | group(apiserver_requested_deprecated_apis{removed_release=~"1\\.2[123]"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total{system_client!="kube-controller-manager",system_client!="cluster-policy-controller"}[4h]))) > 0 for: 1h -- - alert: HighOverallControlPlaneCPU annotations: message: Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain; a single control plane node -- - alert: ExtremelyHighIndividualControlPlaneCPU annotations: message: Extreme CPU pressure can cause slow serialization and poor performance from the kube-apiserver and etcd. When this happens, there is a risk of clients seeing non-responsive API requests which are issued again causing even more CPU pressure. It can also cause failing liveness probes due to slow etcd responsiveness on the backend. If one kube-apiserver fails under this condition, chances are you will experience a cascade as the remaining kube-apiservers are also under-provisioned. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization on a single control plane node is very high, more -- Expected results: alert rules have annotations "summary" and "description" Additional info: the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1] [1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required
The following rule has issue also $ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:11:59Z" generation: 1 name: kube-apiserver-operator namespace: openshift-kube-apiserver-operator ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1674" uid: ca725633-cd62-4d9e-a9f6-c9f4b260e98d spec: groups: - name: cluster-version rules: - alert: TechPreviewNoUpgrade annotations: message: Cluster has enabled tech preview features that will prevent upgrades. expr: | cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0 for: 10m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: ""
You can find more of these like this: oc -n namespace get PrometheusRule -o json | \ jq '.items[]|{namespace: .metadata.namespace,PrometheusRule: "\(.metadata.namespace)/\(.metadata.name)",alert: (..|objects|select(has("alert"))|select(.annotations|(has("description") and has("summary"))|not)|{name:.alert,summary: .annotations|has("summary"),description: .annotations|has("description"),message: .annotations|has("message")})}' Use either -n with a specific namespace or --all-namespaces. Only considering those in the openshift-kube-apiserver and openshift-kube-apiserver-operator namespaces in scope for this bug.
Tested with PR
Ignore #C3, put wrong comments here.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-09-21-181111 True False 2m42s Cluster version is 4.10.0-0.nightly-2021-09-21-181111 $ oc get prometheusrules -n openshift-kube-apiserver -oyaml|grep -A10 'alert:' - alert: APIRemovedInNextReleaseInUse annotations: description: Deprecated API that will be removed in the next version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. summary: Deprecated API that will be removed in the next version is being used. expr: | -- - alert: APIRemovedInNextEUSReleaseInUse annotations: description: Deprecated API that will be removed in the next EUS version is being used. Removing the workload that is using the {{ $labels.group }}.{{ $labels.version }}/{{ $labels.resource }} API might be necessary for a successful upgrade to the next EUS cluster version. Refer to `oc get apirequestcounts {{ $labels.resource }}.{{ $labels.version }}.{{ $labels.group }} -o yaml` to identify the workload. summary: Deprecated API that will be removed in the next EUS version is being used. expr: | ... -- - alert: HighOverallControlPlaneCPU annotations: description: Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain; a single control plane node -- - alert: ExtremelyHighIndividualControlPlaneCPU annotations: description: Extreme CPU pressure can cause slow serialization and poor performance from the kube-apiserver and etcd. When this happens, there is a risk of clients seeing non-responsive API requests which are issued again causing even more CPU pressure. It can also cause failing liveness probes due to slow etcd responsiveness on the backend. If one kube-apiserver fails under this condition, chances are you will experience a cascade as the remaining kube-apiservers are also under-provisioned. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization on a single control plane node is very high, more -- $ oc get prometheusrules -n openshift-kube-apiserver-operator -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: ... name: kube-apiserver-operator namespace: openshift-kube-apiserver-operator ... rules: - alert: TechPreviewNoUpgrade annotations: description: Cluster has enabled Technology Preview features that cannot be undone and will prevent upgrades. The TechPreviewNoUpgrade feature set is not recommended on production clusters. summary: Cluster has enabled tech preview features that will prevent upgrades. expr: | cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0 for: 10m ... Based on above results, the bug was fixed, move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056