Bug 2001411

Summary: All critical alerts should have links to a runbook
Product: OpenShift Container Platform Reporter: hongyan li <hongyli>
Component: Machine Config OperatorAssignee: MCO Team <team-mco>
Machine Config Operator sub component: Machine Config Operator QA Contact: Rio Liu <rioliu>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, mkenigsb, mkrejci
Version: 4.9   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-05 22:16:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description hongyan li 2021-09-06 03:24:47 UTC
Description of problem:
All critical alerts shipped as part of OpenShift need a proper runbook in [1] and a "runbook_url" annotation should be present in the alert definition as per [2].

[1] https://github.com/openshift/runbooks
[2] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-09-05-192114

How reproducible:
Always

Steps to Reproduce:
1. Look for all alerting rules with severity=critical
2.oc -n openshift-machine-api get prometheusrules machine-api-operator-prometheus-rules -oyaml|grep -B10 critical
  - name: machine-api-operator-metrics-collector-up
    rules:
    - alert: MachineAPIOperatorMetricsCollectionFailing
      annotations:
        message: 'machine api operator metrics collection is failing. For more details:  oc
          logs <machine-api-operator-pod-name> -n openshift-machine-api'
      expr: |
        mapi_mao_collector_up == 0
      for: 5m
      labels:
        severity: critical

3.$ oc get prometheusrule machine-config-daemon -n openshift-machine-config-operator -oyaml|grep -B10 critical
spec:
  groups:
  - name: mcd-reboot-error
    rules:
    - alert: MCDRebootError
      annotations:
        message: Reboot failed on {{ $labels.node }} , update may be blocked
      expr: |
        mcd_reboot_err > 0
      labels:
        severity: critical


Actual results:
"runbook_url" annotation links are missing. 

Expected results:
All critical alerts have a proper "runbook_url" annotation.

Additional info:

Comment 1 Michelle Krejci 2021-09-08 17:25:12 UTC
The first alert issue belongs to the Machine API team. @hongyli could you create a separate issue for that team?
The second alert pertaining to the machine-config-daemon will be reviewed for 4.10.

Comment 2 hongyan li 2021-09-09 14:18:11 UTC
@mkrejci thx, I didn't find there is component machine API

Comment 3 mkenigsb 2021-11-05 22:16:06 UTC
This should be addressed as part of https://issues.redhat.com/browse/MCO-1, but closing for now as this is more a feature than a bug