Description of problem: All critical alerts shipped as part of OpenShift need a proper runbook in [1] and a "runbook_url" annotation should be present in the alert definition as per [2]. [1] https://github.com/openshift/runbooks [2] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required Version-Release number of selected component (if applicable): 4.9 How reproducible: Always Steps to Reproduce: 1. Look for all alerting rules with severity=critical shipped by cluster-monitoring-operator. 2. 3. Actual results: "runbook_url" annotation links are missing. Expected results: All critical alerts have a proper "runbook_url" annotation. Additional info:
Test with payload 4.9.0-0.nightly-2021-09-05-192114 The following critical alert rules have no runbook_url, they are not shipped by monitoring, close the bug. $ oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep -B15 'severity: critical' rules: - alert: ClusterVersionOperatorDown annotations: description: The operator may be down or disabled. The cluster will not be kept up to date and upgrades will not be possible. Inspect the openshift-cluster-version namespace for events or changes to the cluster-version-operator deployment or pods to diagnose and repair. {{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more information refer to {{ label "url" (first $console_url ) }}/k8s/cluster/projects/openshift-cluster-version.{{ end }}{{ end }} summary: Cluster version operator has disappeared from Prometheus target discovery. expr: | absent(up{job="cluster-version-operator"} == 1) for: 10m labels: severity: critical - alert: ClusterOperatorDown annotations: description: The {{ $labels.name }} operator may be down or disabled, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ $labels.name }}'{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} or {{ label "url" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}. summary: Cluster operator has not been available for 10 minutes. expr: | cluster_operator_up{job="cluster-version-operator"} == 0 for: 10m labels: severity: critical - name: cluster-version rules: - alert: KubeControllerManagerDown annotations: message: KubeControllerManager has disappeared from Prometheus target discovery. expr: | absent(up{job="kube-controller-manager"} == 1) for: 15m labels: severity: critical - alert: HAProxyDown annotations: message: HAProxy metrics are reporting that HAProxy is down on pod {{ $labels.namespace }} / {{ $labels.pod }} expr: haproxy_up == 0 for: 5m labels: severity: critical -- message: Extreme CPU pressure can cause slow serialization and poor performance from the kube-apiserver and etcd. When this happens, there is a risk of clients seeing non-responsive API requests which are issued again causing even more CPU pressure. It can also cause failing liveness probes due to slow etcd responsiveness on the backend. If one kube-apiserver fails under this condition, chances are you will experience a cascade as the remaining kube-apiservers are also under-provisioned. To fix this, increase the CPU and memory on your control plane nodes. summary: CPU utilization on a single control plane node is very high, more CPU pressure is likely to cause a failover; increase available CPU. expr: | 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 90 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" ) for: 5m labels: namespace: openshift-kube-apiserver severity: critical - alert: PodDisruptionBudgetLimit annotations: message: The pod disruption budget is below the minimum number allowed pods. expr: | max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy) for: 15m labels: severity: critica - name: cluster-version rules: - alert: KubeSchedulerDown annotations: message: KubeScheduler has disappeared from Prometheus target discovery. expr: | absent(up{job="scheduler"} == 1) for: 15m labels: severity: critical - name: machine-api-operator-metrics-collector-up rules: - alert: MachineAPIOperatorMetricsCollectionFailing annotations: message: 'machine api operator metrics collection is failing. For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api' expr: | mapi_mao_collector_up == 0 for: 5m labels: severity: critical - name: mcd-reboot-error rules: - alert: MCDRebootError annotations: message: Reboot failed on {{ $labels.node }} , update may be blocked expr: | mcd_reboot_err > 0 labels: severity: critical - alert: KubeStateMetricsShardingMismatch annotations: description: kube-state-metrics pods are running with different --total-shards configuration, some Kubernetes objects may be exposed multiple times or not exposed at all. summary: kube-state-metrics sharding is misconfigured. expr: | stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0 for: 15m labels: severity: critical - alert: KubeStateMetricsShardsMissing annotations: description: kube-state-metrics shards are missing, some Kubernetes objects are not being exposed. summary: kube-state-metrics shards are missing. expr: | 2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1 - sum( 2 ^ max by (shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) ) != 0 for: 15m labels: severity: critical
The following alerts shipped by monitoring have no runbook_url $ oc -n openshift-monitoring get prometheusrule kube-state-metrics-rules -oyaml|grep -A10 -E 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing' - alert: KubeStateMetricsShardingMismatch annotations: description: kube-state-metrics pods are running with different --total-shards configuration, some Kubernetes objects may be exposed multiple times or not exposed at all. summary: kube-state-metrics sharding is misconfigured. expr: | stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0 for: 15m labels: severity: critical - alert: KubeStateMetricsShardsMissing annotations: description: kube-state-metrics shards are missing, some Kubernetes objects are not being exposed. summary: kube-state-metrics shards are missing. expr: | 2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1 - sum( 2 ^ max by (shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) ) != 0 for: 15m
The 4.9 release branch has been cut, the target release should be 4.10.0. We can backport to 4.9 later though.
I've opened a PR against the master branch, which will become 4.10, that drops these two alerts. We don't configure sharding, so there's no reason to maintain the alerts and runbooks, etc. https://github.com/openshift/cluster-monitoring-operator/pull/1366
tested with 4.10.0-0.nightly-2021-09-14-011939, 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing' alerts are removed, other critical alerts shipped by CMO have runbook_url field # oc -n openshift-monitoring get prometheusrule kube-state-metrics-rules -oyaml|grep -A10 -E 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing' no result
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056