Hello, The OpenShift Monitoring Team has published a set guidelines for writing alerting rules in OpenShift, including a basic style guide. You can find these here: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide A subset of these are now being enforced in OpenShift End-to-End tests [1], with temporary exceptions for existing non-compliant rules. This component was found to have the following issues: * Alerts found to not include a namespace label: - ClusterNotUpgradeable - ClusterOperatorDegraded Alerts SHOULD include a namespace label indicating the alert's source. This requirement originally comes from our SRE team, as they use the namespace label as the first means of routing alerts. Many alerts already include a namespace label as a result of the PromQL expressions used, others may require a static label. Example of a change to PromQL to include a namespace label: https://github.com/openshift/cluster-monitoring-operator/commit/52d1f05#diff-9024dcef0fd244c0267c46858da24fbd1f45633515fafae0f98781b20805ff1dL22-R22 Example of adding a static namespace label: https://github.com/openshift/cluster-monitoring-operator/commit/52d1f05#diff-352702e71122d34a1be04c0588356cd8cb8a10df547f1c3c39fec18fa75b1593R304 If you have questions about how to best to modify your alerting rules to include a namespace label, please reach out to the OpenShift Monitoring Team in the #forum-monitoring channel on Slack, or on our mailing list: team-monitoring Thank you! Repo: openshift/cluster-version-operator [1]: https://github.com/openshift/origin/commit/097e7a6
*** Bug 2021130 has been marked as a duplicate of this bug. ***
> * Alerts found to not include a namespace label: > - ClusterNotUpgradeable > - ClusterOperatorDegraded Tried to reproduce on v4.10.26. 1. Trigger ClusterOperatorDegraded alert. # curl -s -k -H "Authorization: Bearer $token" https://$route/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded").labels' { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.0.6:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", //namespace label was already included in ClusterOperatorDegraded alert "pod": "cluster-version-operator-64bb7d76f4-bn2hx", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" } 2. Trigger ClusterNotUpgradeable alert # curl -s -k -H "Authorization: Bearer $token" https://$route/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterNotUpgradeable").labels' { "alertname": "ClusterNotUpgradeable", "condition": "Upgradeable", "endpoint": "metrics", "name": "version", "severity": "info" } // Miss namespace label in ClusterNotUpgradeable alert 3. Trigger ClusterOperatorDown alert # curl -s -k -H "Authorization: Bearer $token" https://$route/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown").labels' { "alertname": "ClusterOperatorDown", "endpoint": "metrics", "instance": "10.0.0.6:9099", "job": "cluster-version-operator", "name": "machine-config", "namespace": "openshift-cluster-version", //namespace label was already included in ClusterOperatorDown alert "pod": "cluster-version-operator-64bb7d76f4-bn2hx", "service": "cluster-version-operator", "severity": "critical", "version": "4.10.26" } 4. Trigger CannotRetrieveUpdates alert # curl -s -k -H "Authorization: Bearer $token" https://$route/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "CannotRetrieveUpdates")|.labels' { "alertname": "CannotRetrieveUpdates", "endpoint": "metrics", "instance": "10.0.0.6:9099", "job": "cluster-version-operator", "namespace": "openshift-cluster-version", //namespace label was already included in CannotRetrieveUpdates alert "pod": "cluster-version-operator-64bb7d76f4-bn2hx", "service": "cluster-version-operator", "severity": "warning" } According to above reproduce, only ClusterNotUpgradeable alert should add ns label. @Brad Could you confirm ClusterOperatorDegraded alert issue in the bug description? QE can only reproduce it for ClusterNotUpgradeable alert.
The reporter Brad Ison from Monitoring team seems not available(Deactivated account) now, QE plan to verify ClusterNotUpgradeable alert since it turned to be the only one missing namespace label.
Verified on 4.12.0-0.nightly-2022-08-17-053740 # curl -s -k -H "Authorization: Bearer $token" https://$route/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterNotUpgradeable")|.labels' { "alertname": "ClusterNotUpgradeable", "condition": "Upgradeable", "endpoint": "metrics", "name": "version", "namespace": "openshift-cluster-version", "severity": "info" }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399