Hello, The OpenShift Monitoring Team has published a set guidelines for writing alerting rules in OpenShift, including a basic style guide. You can find these here: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide A subset of these are now being enforced in OpenShift End-to-End tests [1], with temporary exceptions for existing non-compliant rules. This component was found to have the following issues: * Alerts without summary and/or description annotations: - MachineAPIOperatorMetricsCollectionFailing - MachineAPIOperatorMetricsCollectionFailing - MachineHealthCheckUnterminatedShortCircuit - MachineNotYetDeleted - MachineWithNoRunningPhase - MachineWithoutValidNode Alerts MUST include summary and description annotations. Think of summary as the first line of a commit message, or an email subject line. It should be brief but informative. The description is the longer, more detailed explanation of the alert. The enhancement document linked above has examples of alerts with these annotations. * Alerts found to not include a namespace label: - MachineAPIOperatorDown Alerts SHOULD include a namespace label indicating the alert's source. This requirement originally comes from our SRE team, as they use the namespace label as the first means of routing alerts. Many alerts already include a namespace label as a result of the PromQL expressions used, others may require a static label. Example of a change to PromQL to include a namespace label: https://github.com/openshift/cluster-monitoring-operator/commit/52d1f05#diff-9024dcef0fd244c0267c46858da24fbd1f45633515fafae0f98781b20805ff1dL22-R22 Example of adding a static namespace label: https://github.com/openshift/cluster-monitoring-operator/commit/52d1f05#diff-352702e71122d34a1be04c0588356cd8cb8a10df547f1c3c39fec18fa75b1593R304 If you have questions about how to best to modify your alerting rules to include a namespace label, please reach out to the OpenShift Monitoring Team in the #forum-monitoring channel on Slack, or on our mailing list: team-monitoring Thank you! Repo: openshift/machine-api-operator [1]: https://github.com/openshift/origin/commit/097e7a6
@brad.ison i was looking at the enhancement and it's not clear to me if we should be keeping the `annotations.message` field that we currently have?
You can drop the `message` annotation. In most cases you can probably just rename `message` to one of `description` or `summary` based on how detailed the current message is, then add the other annotation. We just need to standardize on the `description` and `summary` annotations so that things like the console can display them intelligently.
awesome, thanks for the update Brad =)
added a PR for the summary/description change, but we removed the MachineAPIOperatorDown alert in this pr https://github.com/openshift/machine-api-operator/pull/826
I checked the five alerts in 4.10.0-0.nightly-2021-10-13-081040, - MachineAPIOperatorMetricsCollectionFailing - MachineHealthCheckUnterminatedShortCircuit - MachineNotYetDeleted - MachineWithNoRunningPhase - MachineWithoutValidNode But there are three problems. Please help to check it. Thanks. 1. for MachineHealthCheckUnterminatedShortCircuit, there is still not summary and description annotations although I can see them in the merge code. I don't know why there is not on the cluster. liuhuali@Lius-MacBook-Pro openshift-test % oc get prometheusrule machine-api-operator-prometheus-rules -o yaml |grep -A 10 "alert: MachineHealthCheckUnterminatedShortCircuit" - alert: MachineHealthCheckUnterminatedShortCircuit expr: | mapi_machinehealthcheck_short_circuit == 1 for: 30m labels: severity: warning liuhuali@Lius-MacBook-Pro openshift-test % 2. for MachineAPIOperatorMetricsCollectionFailing, it's a critical alert, according to https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide All critical alerts MUST include a runbook_url annotation. but the alert doesn't include a runbook_url annotation. 3. according to https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide Alerts SHOULD include a namespace label indicating the alert's source. but all the five alerts don't include a namespace label.
(In reply to Huali Liu from comment #8) > I checked the five alerts in 4.10.0-0.nightly-2021-10-13-081040, > > - MachineAPIOperatorMetricsCollectionFailing > - MachineHealthCheckUnterminatedShortCircuit > - MachineNotYetDeleted > - MachineWithNoRunningPhase > - MachineWithoutValidNode > > But there are three problems. Please help to check it. Thanks. > > 1. for MachineHealthCheckUnterminatedShortCircuit, there is still not > summary and description annotations although I can see them in the merge > code. I don't know why there is not on the cluster. > > liuhuali@Lius-MacBook-Pro openshift-test % oc get prometheusrule > machine-api-operator-prometheus-rules -o yaml |grep -A 10 "alert: > MachineHealthCheckUnterminatedShortCircuit" > - alert: MachineHealthCheckUnterminatedShortCircuit > expr: | > mapi_machinehealthcheck_short_circuit == 1 > for: 30m > labels: > severity: warning > liuhuali@Lius-MacBook-Pro openshift-test % that's really odd, not sure why that might happen. perhaps Brad has some ideas? > > 2. for MachineAPIOperatorMetricsCollectionFailing, it's a critical alert, > according to > > https://github.com/openshift/enhancements/blob/master/enhancements/ > monitoring/alerting-consistency.md#style-guide > > All critical alerts MUST include a runbook_url annotation. > > but the alert doesn't include a runbook_url annotation. > ack, thanks. i will look to see if i can find the run book for that alert. > 3. according to > > https://github.com/openshift/enhancements/blob/master/enhancements/ > monitoring/alerting-consistency.md#style-guide > > Alerts SHOULD include a namespace label indicating the alert's source. > > but all the five alerts don't include a namespace label. there is a namespace in the top of the manifest[0] for all the alert rules, do i need to specify them individually as well? [0] https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L8
i check the linked repository of runbooks[0], but i did not see one for the `MachineAPIOperatorMetricsCollectionFailing` alert. is there another source i should check?
i check the linked repository of runbooks[0], but i did not see one for the `MachineAPIOperatorMetricsCollectionFailing` alert. is there another source i should check? [0] https://github.com/openshift/runbooks
For problem 3, I'm sorry, it's my mistake. I just saw there is a namespace in the top of the manifest[0] for all the alert rules, that's ok, and you don't need specify them individually. For problem 2, as according to the guidelines "All critical alerts MUST include a runbook_url annotation." but we cannot see one runbook for the `MachineAPIOperatorMetricsCollectionFailing` critical alert. Maybe you should check with the OpenShift Monitoring Team for how to deal with this situation.
(In reply to Huali Liu from comment #12) > For problem 2, as according to the guidelines "All critical alerts MUST > include a runbook_url annotation." but we cannot see one runbook for the > `MachineAPIOperatorMetricsCollectionFailing` critical alert. Maybe you > should check with the OpenShift Monitoring Team for how to deal with this > situation. sounds good. i will wait for Brad to respond and then if need be i can escalate with the monitoring team.
Thank you Michael.
Sorry, just getting back from PTO and seeing this. > 1. for MachineHealthCheckUnterminatedShortCircuit, there is still not summary and description annotations... Looks like you have a typo in the manifest. It should be `annotations` plural: https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L75 Eventually CI will catch this, but we have too many exceptions in the tests at the moment. > 2. ... All critical alerts MUST include a runbook_url annotation ... We aren't really enforcing this yet as there's still some debate about where they should live etc. But, the idea is that for any critical alert, the team owning the alert needs to write a runbook with details on diagnosing and solving the problem. Currently these are here: https://github.com/openshift/runbooks/tree/master/alerts It would be great if you could add runbooks for your critical alerts, but like I said, we aren't enforcing that yet. > 3. ... Alerts SHOULD include a namespace label indicating the alert's source ... This probably needs further clarification in the enhancement document. This is about triggered alerts having a namespace label. Most alerts already have this because the PromQL expressions they use result in a namespace label in the results. For alerts that don't, usually you can just rewrite the PromQL to include it. In some cases you may have to add a static namespace label in the alert rule definition. There's unfortunately no way to verify this easily via static analysis, so we aren't enforcing it in CI. If you want to trigger your alerts on a test cluster and verify they have a namespace, that's great, but it won't cause you to fail CI or anything if they don't. Hope this helps.
(In reply to Brad Ison from comment #15) > Sorry, just getting back from PTO and seeing this. > no worries, hope it was nice pto =) > > 1. for MachineHealthCheckUnterminatedShortCircuit, there is still not summary and description annotations... > > Looks like you have a typo in the manifest. It should be `annotations` > plural: > > https://github.com/openshift/machine-api-operator/blob/master/install/ > 0000_90_machine-api-operator_04_alertrules.yaml#L75 > doh!, thanks. i've pushed a new PR that fixes this. > Eventually CI will catch this, but we have too many exceptions in the tests > at the moment. > > > > 2. ... All critical alerts MUST include a runbook_url annotation ... > > We aren't really enforcing this yet as there's still some debate about where > they should live etc. But, the idea is that for any critical alert, the team > owning the alert needs to write a runbook with details on diagnosing and > solving the problem. Currently these are here: > > https://github.com/openshift/runbooks/tree/master/alerts > > It would be great if you could add runbooks for your critical alerts, but > like I said, we aren't enforcing that yet. > given that this is not enforced yet, would it be alright if i just make a jira card to capture that our team will create the runbook? > > Hope this helps. it did, thanks!
> given that this is not enforced yet, would it be alright if i just make a jira card to capture that our team will create the runbook? Yep, that's perfect. Thanks.
i have created https://issues.redhat.com/browse/OCPCLOUD-1313 to capture the runbook work
Verified on 4.10.0-0.nightly-2021-10-27-230233 liuhuali@Lius-MacBook-Pro openshift-test % oc get prometheusrule -n openshift-machine-api machine-api-operator-prometheus-rules -o yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-10-28T09:26:54Z" generation: 1 labels: prometheus: k8s role: alert-rules name: machine-api-operator-prometheus-rules namespace: openshift-machine-api ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 45cd1579-f169-4524-9789-58535172f266 resourceVersion: "1947" uid: cfd0e0ea-c67d-4b2e-b44b-70a4bd9eeb07 spec: groups: - name: machine-without-valid-node-ref rules: - alert: MachineWithoutValidNode annotations: description: | If the machine never became a node, you should diagnose the machine related failures. If the node was deleted from the API, you may delete the machine if appropriate. summary: machine {{ $labels.name }} does not have valid node reference expr: | (mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0 for: 60m labels: severity: warning - name: machine-with-no-running-phase rules: - alert: MachineWithNoRunningPhase annotations: description: | The machine has been without a Running or Deleting phase for more than 60 minutes. The machine may not have been provisioned properly from the infrastructure provider, or it might have issues with CertificateSigningRequests being approved. summary: 'machine {{ $labels.name }} is in phase: {{ $labels.phase }}' expr: | (mapi_machine_created_timestamp_seconds{phase!~"Running|Deleting"}) > 0 for: 60m labels: severity: warning - name: machine-not-yet-deleted rules: - alert: MachineNotYetDeleted annotations: description: | The machine is not properly deleting, this may be due to a configuration issue with the infrastructure provider, or because workloads on the node have PodDisruptionBudgets or long termination periods which are preventing deletion. summary: machine {{ $labels.name }} has been in Deleting phase for more than 6 hours expr: | sum by (name, namespace) (avg_over_time(mapi_machine_created_timestamp_seconds{phase="Deleting"}[15m])) > 0 for: 360m labels: severity: warning - name: machine-api-operator-metrics-collector-up rules: - alert: MachineAPIOperatorMetricsCollectionFailing annotations: description: 'For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api' summary: machine api operator metrics collection is failing. expr: | mapi_mao_collector_up == 0 for: 5m labels: severity: critical - name: machine-health-check-unterminated-short-circuit rules: - alert: MachineHealthCheckUnterminatedShortCircuit annotations: description: | The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check the status of machines in the cluster. summary: machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes expr: | mapi_machinehealthcheck_short_circuit == 1 for: 30m labels: severity: warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
*** Bug 2042231 has been marked as a duplicate of this bug. ***