Bug 2000490
| Summary: | All critical alerts shipped by CMO should have links to a runbook | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> | |
| Component: | Monitoring | Assignee: | Brad Ison <brad.ison> | |
| Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.9 | CC: | amuller, anpicker, aos-bugs, erooth, hongyli, juzhao | |
| Target Milestone: | --- | |||
| Target Release: | 4.10.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2013148 (view as bug list) | Environment: | ||
| Last Closed: | 2022-03-10 16:07:01 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2013148, 2014512 | |||
|
Description
Simon Pasquier
2021-09-02 09:22:48 UTC
Test with payload 4.9.0-0.nightly-2021-09-05-192114
The following critical alert rules have no runbook_url, they are not shipped by monitoring, close the bug.
$ oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep -B15 'severity: critical'
rules:
- alert: ClusterVersionOperatorDown
annotations:
description: The operator may be down or disabled. The cluster will not be kept
up to date and upgrades will not be possible. Inspect the openshift-cluster-version
namespace for events or changes to the cluster-version-operator deployment
or pods to diagnose and repair. {{ with $console_url := "console_url" | query
}}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more information
refer to {{ label "url" (first $console_url ) }}/k8s/cluster/projects/openshift-cluster-version.{{
end }}{{ end }}
summary: Cluster version operator has disappeared from Prometheus target discovery.
expr: |
absent(up{job="cluster-version-operator"} == 1)
for: 10m
labels:
severity: critical
- alert: ClusterOperatorDown
annotations:
description: The {{ $labels.name }} operator may be down or disabled, and the
components it manages may be unavailable or degraded. Cluster upgrades may
not complete. For more information refer to 'oc get -o yaml clusteroperator
{{ $labels.name }}'{{ with $console_url := "console_url" | query }}{{ if ne
(len (label "url" (first $console_url ) ) ) 0}} or {{ label "url" (first $console_url
) }}/settings/cluster/{{ end }}{{ end }}.
summary: Cluster operator has not been available for 10 minutes.
expr: |
cluster_operator_up{job="cluster-version-operator"} == 0
for: 10m
labels:
severity: critical
- name: cluster-version
rules:
- alert: KubeControllerManagerDown
annotations:
message: KubeControllerManager has disappeared from Prometheus target discovery.
expr: |
absent(up{job="kube-controller-manager"} == 1)
for: 15m
labels:
severity: critical
- alert: HAProxyDown
annotations:
message: HAProxy metrics are reporting that HAProxy is down on pod {{ $labels.namespace
}} / {{ $labels.pod }}
expr: haproxy_up == 0
for: 5m
labels:
severity: critical
--
message: Extreme CPU pressure can cause slow serialization and poor performance
from the kube-apiserver and etcd. When this happens, there is a risk of clients
seeing non-responsive API requests which are issued again causing even more
CPU pressure. It can also cause failing liveness probes due to slow etcd responsiveness
on the backend. If one kube-apiserver fails under this condition, chances
are you will experience a cascade as the remaining kube-apiservers are also
under-provisioned. To fix this, increase the CPU and memory on your control
plane nodes.
summary: CPU utilization on a single control plane node is very high, more CPU
pressure is likely to cause a failover; increase available CPU.
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 90 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )
for: 5m
labels:
namespace: openshift-kube-apiserver
severity: critical
- alert: PodDisruptionBudgetLimit
annotations:
message: The pod disruption budget is below the minimum number allowed pods.
expr: |
max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy)
for: 15m
labels:
severity: critica
- name: cluster-version
rules:
- alert: KubeSchedulerDown
annotations:
message: KubeScheduler has disappeared from Prometheus target discovery.
expr: |
absent(up{job="scheduler"} == 1)
for: 15m
labels:
severity: critical
- name: machine-api-operator-metrics-collector-up
rules:
- alert: MachineAPIOperatorMetricsCollectionFailing
annotations:
message: 'machine api operator metrics collection is failing. For more details: oc
logs <machine-api-operator-pod-name> -n openshift-machine-api'
expr: |
mapi_mao_collector_up == 0
for: 5m
labels:
severity: critical
- name: mcd-reboot-error
rules:
- alert: MCDRebootError
annotations:
message: Reboot failed on {{ $labels.node }} , update may be blocked
expr: |
mcd_reboot_err > 0
labels:
severity: critical
- alert: KubeStateMetricsShardingMismatch
annotations:
description: kube-state-metrics pods are running with different --total-shards
configuration, some Kubernetes objects may be exposed multiple times or not
exposed at all.
summary: kube-state-metrics sharding is misconfigured.
expr: |
stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0
for: 15m
labels:
severity: critical
- alert: KubeStateMetricsShardsMissing
annotations:
description: kube-state-metrics shards are missing, some Kubernetes objects
are not being exposed.
summary: kube-state-metrics shards are missing.
expr: |
2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1
-
sum( 2 ^ max by (shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) )
!= 0
for: 15m
labels:
severity: critical
The following alerts shipped by monitoring have no runbook_url
$ oc -n openshift-monitoring get prometheusrule kube-state-metrics-rules -oyaml|grep -A10 -E 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing'
- alert: KubeStateMetricsShardingMismatch
annotations:
description: kube-state-metrics pods are running with different --total-shards
configuration, some Kubernetes objects may be exposed multiple times or
not exposed at all.
summary: kube-state-metrics sharding is misconfigured.
expr: |
stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0
for: 15m
labels:
severity: critical
- alert: KubeStateMetricsShardsMissing
annotations:
description: kube-state-metrics shards are missing, some Kubernetes objects
are not being exposed.
summary: kube-state-metrics shards are missing.
expr: |
2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1
-
sum( 2 ^ max by (shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) )
!= 0
for: 15m
The 4.9 release branch has been cut, the target release should be 4.10.0. We can backport to 4.9 later though. I've opened a PR against the master branch, which will become 4.10, that drops these two alerts. We don't configure sharding, so there's no reason to maintain the alerts and runbooks, etc. https://github.com/openshift/cluster-monitoring-operator/pull/1366 tested with 4.10.0-0.nightly-2021-09-14-011939, 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing' alerts are removed, other critical alerts shipped by CMO have runbook_url field # oc -n openshift-monitoring get prometheusrule kube-state-metrics-rules -oyaml|grep -A10 -E 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing' no result Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |