Bug 2000490

Summary:	All critical alerts shipped by CMO should have links to a runbook
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Monitoring	Assignee:	Brad Ison <brad.ison>
Status:	CLOSED ERRATA	QA Contact:	hongyan li <hongyli>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.9	CC:	amuller, anpicker, aos-bugs, erooth, hongyli, juzhao
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2013148 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:07:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2013148, 2014512

Description Simon Pasquier 2021-09-02 09:22:48 UTC

Description of problem:
All critical alerts shipped as part of OpenShift need a proper runbook in [1] and a "runbook_url" annotation should be present in the alert definition as per [2].

[1] https://github.com/openshift/runbooks
[2] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required

Version-Release number of selected component (if applicable):
4.9

How reproducible:
Always

Steps to Reproduce:
1. Look for all alerting rules with severity=critical shipped by cluster-monitoring-operator.
2.
3.

Actual results:
"runbook_url" annotation links are missing. 

Expected results:
All critical alerts have a proper "runbook_url" annotation.

Additional info:

Comment 3 hongyan li 2021-09-06 02:18:50 UTC

Test with payload 4.9.0-0.nightly-2021-09-05-192114

The following critical alert rules have no runbook_url, they are not shipped by monitoring, close the bug.

$ oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep -B15 'severity: critical'

      rules:
      - alert: ClusterVersionOperatorDown
        annotations:
          description: The operator may be down or disabled. The cluster will not be kept
            up to date and upgrades will not be possible. Inspect the openshift-cluster-version
            namespace for events or changes to the cluster-version-operator deployment
            or pods to diagnose and repair. {{ with $console_url := "console_url" | query
            }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more information
            refer to {{ label "url" (first $console_url ) }}/k8s/cluster/projects/openshift-cluster-version.{{
            end }}{{ end }}
          summary: Cluster version operator has disappeared from Prometheus target discovery.
        expr: |
          absent(up{job="cluster-version-operator"} == 1)
        for: 10m
        labels:
          severity: critical

      - alert: ClusterOperatorDown
        annotations:
          description: The {{ $labels.name }} operator may be down or disabled, and the
            components it manages may be unavailable or degraded.  Cluster upgrades may
            not complete. For more information refer to 'oc get -o yaml clusteroperator
            {{ $labels.name }}'{{ with $console_url := "console_url" | query }}{{ if ne
            (len (label "url" (first $console_url ) ) ) 0}} or {{ label "url" (first $console_url
            ) }}/settings/cluster/{{ end }}{{ end }}.
          summary: Cluster operator has not been available for 10 minutes.
        expr: |
          cluster_operator_up{job="cluster-version-operator"} == 0
        for: 10m
        labels:
          severity: critical


    - name: cluster-version
      rules:
      - alert: KubeControllerManagerDown
        annotations:
          message: KubeControllerManager has disappeared from Prometheus target discovery.
        expr: |
          absent(up{job="kube-controller-manager"} == 1)
        for: 15m
        labels:
          severity: critical

      - alert: HAProxyDown
        annotations:
          message: HAProxy metrics are reporting that HAProxy is down on pod {{ $labels.namespace
            }} / {{ $labels.pod }}
        expr: haproxy_up == 0
        for: 5m
        labels:
          severity: critical
--
          message: Extreme CPU pressure can cause slow serialization and poor performance
            from the kube-apiserver and etcd. When this happens, there is a risk of clients
            seeing non-responsive API requests which are issued again causing even more
            CPU pressure. It can also cause failing liveness probes due to slow etcd responsiveness
            on the backend. If one kube-apiserver fails under this condition, chances
            are you will experience a cascade as the remaining kube-apiservers are also
            under-provisioned. To fix this, increase the CPU and memory on your control
            plane nodes.
          summary: CPU utilization on a single control plane node is very high, more CPU
            pressure is likely to cause a failover; increase available CPU.
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 90 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )
        for: 5m
        labels:
          namespace: openshift-kube-apiserver
          severity: critical

      - alert: PodDisruptionBudgetLimit
        annotations:
          message: The pod disruption budget is below the minimum number allowed pods.
        expr: |
          max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy)
        for: 15m
        labels:
          severity: critica

    - name: cluster-version
      rules:
      - alert: KubeSchedulerDown
        annotations:
          message: KubeScheduler has disappeared from Prometheus target discovery.
        expr: |
          absent(up{job="scheduler"} == 1)
        for: 15m
        labels:
          severity: critical


    - name: machine-api-operator-metrics-collector-up
      rules:
      - alert: MachineAPIOperatorMetricsCollectionFailing
        annotations:
          message: 'machine api operator metrics collection is failing. For more details:  oc
            logs <machine-api-operator-pod-name> -n openshift-machine-api'
        expr: |
          mapi_mao_collector_up == 0
        for: 5m
        labels:
          severity: critical

      
    - name: mcd-reboot-error
      rules:
      - alert: MCDRebootError
        annotations:
          message: Reboot failed on {{ $labels.node }} , update may be blocked
        expr: |
          mcd_reboot_err > 0
        labels:
          severity: critical


      - alert: KubeStateMetricsShardingMismatch
        annotations:
          description: kube-state-metrics pods are running with different --total-shards
            configuration, some Kubernetes objects may be exposed multiple times or not
            exposed at all.
          summary: kube-state-metrics sharding is misconfigured.
        expr: |
          stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0
        for: 15m
        labels:
          severity: critical


      - alert: KubeStateMetricsShardsMissing
        annotations:
          description: kube-state-metrics shards are missing, some Kubernetes objects
            are not being exposed.
          summary: kube-state-metrics shards are missing.
        expr: |
          2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1
            -
          sum( 2 ^ max by (shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) )
          != 0
        for: 15m
        labels:
          severity: critical

Comment 4 hongyan li 2021-09-06 03:38:07 UTC

The following alerts shipped by monitoring have no runbook_url

$ oc -n openshift-monitoring get prometheusrule kube-state-metrics-rules -oyaml|grep -A10 -E 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing'
    - alert: KubeStateMetricsShardingMismatch
      annotations:
        description: kube-state-metrics pods are running with different --total-shards
          configuration, some Kubernetes objects may be exposed multiple times or
          not exposed at all.
        summary: kube-state-metrics sharding is misconfigured.
      expr: |
        stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0
      for: 15m
      labels:
        severity: critical
    - alert: KubeStateMetricsShardsMissing
      annotations:
        description: kube-state-metrics shards are missing, some Kubernetes objects
          are not being exposed.
        summary: kube-state-metrics shards are missing.
      expr: |
        2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1
          -
        sum( 2 ^ max by (shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) )
        != 0
      for: 15m

Comment 5 Simon Pasquier 2021-09-07 06:51:38 UTC

The 4.9 release branch has been cut, the target release should be 4.10.0. We can backport to 4.9 later though.

Comment 6 Brad Ison 2021-09-07 11:57:52 UTC

I've opened a PR against the master branch, which will become 4.10, that drops these two alerts. We don't configure sharding, so there's no reason to maintain the alerts and runbooks, etc.

https://github.com/openshift/cluster-monitoring-operator/pull/1366

Comment 9 Junqi Zhao 2021-09-14 07:08:13 UTC

tested with 4.10.0-0.nightly-2021-09-14-011939, 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing' alerts are removed, other critical alerts shipped by CMO have runbook_url field
# oc -n openshift-monitoring get prometheusrule kube-state-metrics-rules -oyaml|grep -A10 -E 'KubeStateMetricsShardingMismatch|KubeStateMetricsShardsMissing'
no result

Comment 14 errata-xmlrpc 2022-03-10 16:07:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056