2001409 – All critical alerts should have links to a runbook

Bug 2001409 - All critical alerts should have links to a runbook [NEEDINFO]

Summary: All critical alerts should have links to a runbook

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Filip Krepinsky
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:	LifecycleStale
Depends On:
Blocks:	2114580
TreeView+	depends on / blocked

Reported:	2021-09-06 03:11 UTC by hongyan li
Modified:	2023-01-17 19:46 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: kube-controller-manager alerts (KubeControllerManagerDown, PodDisruptionBudgetAtLimit, PodDisruptionBudgetLimit, GarbageCollectorSyncFailed) now have links to github runbooks. Reason: The runbooks help with understanding and debugging these alerts. * With this update, `kube-controller-manager` alerts (`KubeControllerManagerDown`, `PodDisruptionBudgetAtLimit`, `PodDisruptionBudgetLimit`, and `GarbageCollectorSyncFailed`) have links to Github runbooks. The runbooks help users to understand debug these alerts. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2001409[BZ#2001409])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:46:45 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 635	None	open	Bug 2001409: add runbook urls to KCM-o alerts	2022-06-28 19:11:45 UTC
Github	openshift runbooks pull 57	None	Merged	Bug 2001409: Introduce runbooks for kube-controller-manager-operator alerts	2022-08-22 13:41:36 UTC
Red Hat Product Errata	RHSA-2022:7399	None	None	None	2023-01-17 19:46:57 UTC

Description hongyan li 2021-09-06 03:11:59 UTC

Description of problem:
All critical alerts shipped as part of OpenShift need a proper runbook in [1] and a "runbook_url" annotation should be present in the alert definition as per [2].

[1] https://github.com/openshift/runbooks
[2] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-09-05-192114

How reproducible:
Always

Steps to Reproduce:
1. Look for all alerting rules with severity=critical shipped by cluster-monitoring-operator.
2.
$ oc -n openshift-kube-controller-manager-operator get prometheusrules kube-controller-manager-operator -oyaml|grep -B10 critical
  groups:
  - name: cluster-version
    rules:
    - alert: KubeControllerManagerDown
      annotations:
        message: KubeControllerManager has disappeared from Prometheus target discovery.
      expr: |
        absent(up{job="kube-controller-manager"} == 1)
      for: 15m
      labels:
        severity: critical
--
      for: 60m
      labels:
        severity: warning
    - alert: PodDisruptionBudgetLimit
      annotations:
        message: The pod disruption budget is below the minimum number allowed pods.
      expr: |
        max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy)
      for: 15m
      labels:
        severity: critical



Actual results:
"runbook_url" annotation links are missing. 

Expected results:
All critical alerts have a proper "runbook_url" annotation.

Additional info:

Comment 1 Michal Fojtik 2021-10-22 11:01:58 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 2 Michal Fojtik 2021-11-25 12:58:56 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 3 Michal Fojtik 2021-12-25 14:22:28 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 Filip Krepinsky 2022-04-04 20:53:52 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 6 Filip Krepinsky 2022-05-16 22:02:35 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 8 zhou ying 2022-08-16 02:48:10 UTC

oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-08-15-150248   True        False         24m     Cluster version is 4.12.0-0.nightly-2022-08-15-150248

oc -n openshift-kube-controller-manager-operator get prometheusrules kube-controller-manager-operator -oyaml|grep -B10 critical
      annotations:
        description: KubeControllerManager has disappeared from Prometheus target
          discovery.
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/KubeControllerManagerDown.md
        summary: Target disappeared from Prometheus target discovery.
      expr: |
        absent(up{job="kube-controller-manager"} == 1)
      for: 15m
      labels:
        namespace: openshift-kube-controller-manager
        severity: critical
--
      annotations:
        description: The pod disruption budget is below the minimum disruptions allowed
          level and is not satisfied. The number of current healthy pods is less than
          the desired healthy pods.
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/PodDisruptionBudgetLimit.md
        summary: The pod disruption budget registers insufficient amount of pods.
      expr: |
        max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy)
      for: 15m
      labels:
        severity: critical

Comment 11 errata-xmlrpc 2023-01-17 19:46:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.