Bug 2010368 - OpenShift Alerting Rules Style-Guide Compliance
Summary: OpenShift Alerting Rules Style-Guide Compliance
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.10.0
Assignee: Michael McCune
QA Contact: Huali Liu
URL:
Whiteboard:
: 2042231 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-04 14:02 UTC by Brad Ison
Modified: 2022-08-31 12:54 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:16:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 929 0 None open Bug 2010368: modify alerts to contain summary and description 2021-10-08 20:57:19 UTC
Github openshift machine-api-operator pull 942 0 None open Bug 2010368: fix mispelled field in alert rules 2021-10-26 15:25:13 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:16:43 UTC

Description Brad Ison 2021-10-04 14:02:30 UTC
Hello,

The OpenShift Monitoring Team has published a set guidelines for
writing alerting rules in OpenShift, including a basic style guide.
You can find these here:

  https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md
  https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide

A subset of these are now being enforced in OpenShift End-to-End
tests [1], with temporary exceptions for existing non-compliant rules.

This component was found to have the following issues:

* Alerts without summary and/or description annotations:

  - MachineAPIOperatorMetricsCollectionFailing
  - MachineAPIOperatorMetricsCollectionFailing
  - MachineHealthCheckUnterminatedShortCircuit
  - MachineNotYetDeleted
  - MachineWithNoRunningPhase
  - MachineWithoutValidNode

Alerts MUST include summary and description annotations.

Think of summary as the first line of a commit message, or an email
subject line. It should be brief but informative. The description is
the longer, more detailed explanation of the alert.

The enhancement document linked above has examples of alerts with
these annotations.


* Alerts found to not include a namespace label:

  - MachineAPIOperatorDown

Alerts SHOULD include a namespace label indicating the alert's source.

This requirement originally comes from our SRE team, as they use the
namespace label as the first means of routing alerts. Many alerts
already include a namespace label as a result of the PromQL
expressions used, others may require a static label.

Example of a change to PromQL to include a namespace label:

  https://github.com/openshift/cluster-monitoring-operator/commit/52d1f05#diff-9024dcef0fd244c0267c46858da24fbd1f45633515fafae0f98781b20805ff1dL22-R22

Example of adding a static namespace label:

  https://github.com/openshift/cluster-monitoring-operator/commit/52d1f05#diff-352702e71122d34a1be04c0588356cd8cb8a10df547f1c3c39fec18fa75b1593R304

If you have questions about how to best to modify your alerting rules
to include a namespace label, please reach out to the OpenShift
Monitoring Team in the #forum-monitoring channel on Slack, or on our
mailing list: team-monitoring

Thank you!

Repo: openshift/machine-api-operator

[1]: https://github.com/openshift/origin/commit/097e7a6

Comment 1 Michael McCune 2021-10-05 20:26:23 UTC
@brad.ison i was looking at the enhancement and it's not clear to me if we should be keeping the `annotations.message` field that we currently have?

Comment 2 Brad Ison 2021-10-07 12:49:27 UTC
You can drop the `message` annotation. In most cases you can probably just rename `message` to one of `description` or `summary` based on how detailed the current message is, then add the other annotation. We just need to standardize on the `description` and `summary` annotations so that things like the console can display them intelligently.

Comment 3 Michael McCune 2021-10-07 13:08:06 UTC
awesome, thanks for the update Brad =)

Comment 4 Michael McCune 2021-10-08 20:59:36 UTC
added a PR for the summary/description change, but we removed the MachineAPIOperatorDown alert in this pr https://github.com/openshift/machine-api-operator/pull/826

Comment 8 Huali Liu 2021-10-14 10:17:28 UTC
I checked the five alerts in 4.10.0-0.nightly-2021-10-13-081040,

  - MachineAPIOperatorMetricsCollectionFailing
  - MachineHealthCheckUnterminatedShortCircuit
  - MachineNotYetDeleted
  - MachineWithNoRunningPhase
  - MachineWithoutValidNode

But there are three problems. Please help to check it. Thanks.

1. for MachineHealthCheckUnterminatedShortCircuit, there is still not summary and description annotations although I can see them in the merge code. I don't know why there is not on the cluster.

liuhuali@Lius-MacBook-Pro openshift-test % oc get prometheusrule machine-api-operator-prometheus-rules -o yaml |grep -A 10 "alert: MachineHealthCheckUnterminatedShortCircuit"
    - alert: MachineHealthCheckUnterminatedShortCircuit
      expr: |
        mapi_machinehealthcheck_short_circuit == 1
      for: 30m
      labels:
        severity: warning
liuhuali@Lius-MacBook-Pro openshift-test % 

2. for MachineAPIOperatorMetricsCollectionFailing, it's a critical alert, according to 

https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide

All critical alerts MUST include a runbook_url annotation.

but the alert doesn't include a runbook_url annotation.

3. according to 

https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide

Alerts SHOULD include a namespace label indicating the alert's source.

but all the five alerts don't include a namespace label.

Comment 9 Michael McCune 2021-10-14 20:12:18 UTC
(In reply to Huali Liu from comment #8)
> I checked the five alerts in 4.10.0-0.nightly-2021-10-13-081040,
> 
>   - MachineAPIOperatorMetricsCollectionFailing
>   - MachineHealthCheckUnterminatedShortCircuit
>   - MachineNotYetDeleted
>   - MachineWithNoRunningPhase
>   - MachineWithoutValidNode
> 
> But there are three problems. Please help to check it. Thanks.
> 
> 1. for MachineHealthCheckUnterminatedShortCircuit, there is still not
> summary and description annotations although I can see them in the merge
> code. I don't know why there is not on the cluster.
> 
> liuhuali@Lius-MacBook-Pro openshift-test % oc get prometheusrule
> machine-api-operator-prometheus-rules -o yaml |grep -A 10 "alert:
> MachineHealthCheckUnterminatedShortCircuit"
>     - alert: MachineHealthCheckUnterminatedShortCircuit
>       expr: |
>         mapi_machinehealthcheck_short_circuit == 1
>       for: 30m
>       labels:
>         severity: warning
> liuhuali@Lius-MacBook-Pro openshift-test % 

that's really odd, not sure why that might happen. perhaps Brad has some ideas?

> 
> 2. for MachineAPIOperatorMetricsCollectionFailing, it's a critical alert,
> according to 
> 
> https://github.com/openshift/enhancements/blob/master/enhancements/
> monitoring/alerting-consistency.md#style-guide
> 
> All critical alerts MUST include a runbook_url annotation.
> 
> but the alert doesn't include a runbook_url annotation.
> 

ack, thanks. i will look to see if i can find the run book for that alert.

> 3. according to 
> 
> https://github.com/openshift/enhancements/blob/master/enhancements/
> monitoring/alerting-consistency.md#style-guide
> 
> Alerts SHOULD include a namespace label indicating the alert's source.
> 
> but all the five alerts don't include a namespace label.

there is a namespace in the top of the manifest[0] for all the alert rules, do i need to specify them individually as well?

[0] https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L8

Comment 10 Michael McCune 2021-10-14 21:21:33 UTC
i check the linked repository of runbooks[0], but i did not see one for the `MachineAPIOperatorMetricsCollectionFailing` alert. is there another source i should check?

Comment 11 Michael McCune 2021-10-14 21:22:30 UTC
i check the linked repository of runbooks[0], but i did not see one for the `MachineAPIOperatorMetricsCollectionFailing` alert. is there another source i should check?

[0] https://github.com/openshift/runbooks

Comment 12 Huali Liu 2021-10-15 02:11:39 UTC
For problem 3, I'm sorry, it's my mistake. I just saw there is a namespace in the top of the manifest[0] for all the alert rules, that's ok, and you don't need specify them individually.

For problem 2, as according to the guidelines "All critical alerts MUST include a runbook_url annotation." but we cannot see one runbook for the `MachineAPIOperatorMetricsCollectionFailing` critical alert. Maybe you should check with the OpenShift Monitoring Team for how to deal with this situation.

Comment 13 Michael McCune 2021-10-15 18:21:47 UTC
(In reply to Huali Liu from comment #12)
> For problem 2, as according to the guidelines "All critical alerts MUST
> include a runbook_url annotation." but we cannot see one runbook for the
> `MachineAPIOperatorMetricsCollectionFailing` critical alert. Maybe you
> should check with the OpenShift Monitoring Team for how to deal with this
> situation.

sounds good. i will wait for Brad to respond and then if need be i can escalate with the monitoring team.

Comment 14 Huali Liu 2021-10-18 05:27:45 UTC
Thank you Michael.

Comment 15 Brad Ison 2021-10-26 10:42:47 UTC
Sorry, just getting back from PTO and seeing this.

> 1. for MachineHealthCheckUnterminatedShortCircuit, there is still not summary and description annotations...

Looks like you have a typo in the manifest. It should be `annotations` plural:

https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L75

Eventually CI will catch this, but we have too many exceptions in the tests at the moment.


> 2. ... All critical alerts MUST include a runbook_url annotation ...

We aren't really enforcing this yet as there's still some debate about where they should live etc. But, the idea is that for any critical alert, the team owning the alert needs to write a runbook with details on diagnosing and solving the problem. Currently these are here:

https://github.com/openshift/runbooks/tree/master/alerts

It would be great if you could add runbooks for your critical alerts, but like I said, we aren't enforcing that yet.


> 3. ... Alerts SHOULD include a namespace label indicating the alert's source ...

This probably needs further clarification in the enhancement document. This is about triggered alerts having a namespace label. Most alerts already have this because the PromQL expressions they use result in a namespace label in the results. For alerts that don't, usually you can just rewrite the PromQL to include it. In some cases you may have to add a static namespace label in the alert rule definition. There's unfortunately no way to verify this easily via static analysis, so we aren't enforcing it in CI. If you want to trigger your alerts on a test cluster and verify they have a namespace, that's great, but it won't cause you to fail CI or anything if they don't.

Hope this helps.

Comment 16 Michael McCune 2021-10-26 15:27:02 UTC
(In reply to Brad Ison from comment #15)
> Sorry, just getting back from PTO and seeing this.
> 

no worries, hope it was nice pto =)

> > 1. for MachineHealthCheckUnterminatedShortCircuit, there is still not summary and description annotations...
> 
> Looks like you have a typo in the manifest. It should be `annotations`
> plural:
> 
> https://github.com/openshift/machine-api-operator/blob/master/install/
> 0000_90_machine-api-operator_04_alertrules.yaml#L75
> 

doh!, thanks. i've pushed a new PR that fixes this.

> Eventually CI will catch this, but we have too many exceptions in the tests
> at the moment.
> 
> 
> > 2. ... All critical alerts MUST include a runbook_url annotation ...
> 
> We aren't really enforcing this yet as there's still some debate about where
> they should live etc. But, the idea is that for any critical alert, the team
> owning the alert needs to write a runbook with details on diagnosing and
> solving the problem. Currently these are here:
> 
> https://github.com/openshift/runbooks/tree/master/alerts
> 
> It would be great if you could add runbooks for your critical alerts, but
> like I said, we aren't enforcing that yet.
> 

given that this is not enforced yet, would it be alright if i just make a jira card to capture that our team will create the runbook?


> 
> Hope this helps.

it did, thanks!

Comment 17 Brad Ison 2021-10-26 16:16:27 UTC
> given that this is not enforced yet, would it be alright if i just make a jira card to capture that our team will create the runbook?

Yep, that's perfect. Thanks.

Comment 19 Michael McCune 2021-10-27 13:54:31 UTC
i have created https://issues.redhat.com/browse/OCPCLOUD-1313 to capture the runbook work

Comment 21 Huali Liu 2021-10-28 10:21:53 UTC
Verified on 4.10.0-0.nightly-2021-10-27-230233

liuhuali@Lius-MacBook-Pro openshift-test % oc get prometheusrule -n openshift-machine-api machine-api-operator-prometheus-rules -o yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2021-10-28T09:26:54Z"
  generation: 1
  labels:
    prometheus: k8s
    role: alert-rules
  name: machine-api-operator-prometheus-rules
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 45cd1579-f169-4524-9789-58535172f266
  resourceVersion: "1947"
  uid: cfd0e0ea-c67d-4b2e-b44b-70a4bd9eeb07
spec:
  groups:
  - name: machine-without-valid-node-ref
    rules:
    - alert: MachineWithoutValidNode
      annotations:
        description: |
          If the machine never became a node, you should diagnose the machine related failures.
          If the node was deleted from the API, you may delete the machine if appropriate.
        summary: machine {{ $labels.name }} does not have valid node reference
      expr: |
        (mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0
      for: 60m
      labels:
        severity: warning
  - name: machine-with-no-running-phase
    rules:
    - alert: MachineWithNoRunningPhase
      annotations:
        description: |
          The machine has been without a Running or Deleting phase for more than 60 minutes.
          The machine may not have been provisioned properly from the infrastructure provider, or
          it might have issues with CertificateSigningRequests being approved.
        summary: 'machine {{ $labels.name }} is in phase: {{ $labels.phase }}'
      expr: |
        (mapi_machine_created_timestamp_seconds{phase!~"Running|Deleting"}) > 0
      for: 60m
      labels:
        severity: warning
  - name: machine-not-yet-deleted
    rules:
    - alert: MachineNotYetDeleted
      annotations:
        description: |
          The machine is not properly deleting, this may be due to a configuration issue with the
          infrastructure provider, or because workloads on the node have PodDisruptionBudgets or
          long termination periods which are preventing deletion.
        summary: machine {{ $labels.name }} has been in Deleting phase for more than
          6 hours
      expr: |
        sum by (name, namespace) (avg_over_time(mapi_machine_created_timestamp_seconds{phase="Deleting"}[15m])) > 0
      for: 360m
      labels:
        severity: warning
  - name: machine-api-operator-metrics-collector-up
    rules:
    - alert: MachineAPIOperatorMetricsCollectionFailing
      annotations:
        description: 'For more details:  oc logs <machine-api-operator-pod-name> -n
          openshift-machine-api'
        summary: machine api operator metrics collection is failing.
      expr: |
        mapi_mao_collector_up == 0
      for: 5m
      labels:
        severity: critical
  - name: machine-health-check-unterminated-short-circuit
    rules:
    - alert: MachineHealthCheckUnterminatedShortCircuit
      annotations:
        description: |
          The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check
          the status of machines in the cluster.
        summary: machine health check {{ $labels.name }} has been disabled by short
          circuit for more than 30 minutes
      expr: |
        mapi_machinehealthcheck_short_circuit == 1
      for: 30m
      labels:
        severity: warning

Comment 24 errata-xmlrpc 2022-03-10 16:16:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 25 Jack Ottofaro 2022-08-31 12:54:32 UTC
*** Bug 2042231 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.