Created attachment 1850565 [details] ThanosRuleNoEvaluationFor10Intervals alert expr result in prometheus Description of problem: came across ThanosRuleNoEvaluationFor10Intervals alert, the description is "Thanos Rule thanos-ruler in openshift-user-workload-monitoring has 1.6G% rule groups that did not evaluate for at least 10x of their expected interval." checked the alert detail: 1. {{$value | humanize}}% should be {{$value | humanize}} 2. the expr seems weird, not sure if is right - alert: ThanosRuleNoEvaluationFor10Intervals annotations: description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has {{$value | humanize}}% rule groups that did not evaluate for at least 10x of their expected interval. summary: Thanos Rule has rule groups that did not evaluate for 10 intervals. expr: | time() - max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"}) > 10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"}) for: 5m labels: severity: info Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-11-065245 How reproducible: always Steps to Reproduce: 1. see the description 2. 3. Actual results: Expected results: Additional info:
I feel the alert rule has no issue, this is not a bug.
The description of the alert has issue
Created upstream PR https://github.com/thanos-io/thanos/pull/5105
Changes have been pulled into CMO with this PR https://github.com/openshift/cluster-monitoring-operator/pull/1556/
tested with 4.11.0-0.nightly-2022-06-04-014713, ThanosRuleNoEvaluationFor10Intervals definition is updated to: - alert: ThanosRuleNoEvaluationFor10Intervals annotations: description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has rule groups that did not evaluate for at least 10x of their expected interval. summary: Thanos Rule has rule groups that did not evaluate for 10 intervals. expr: | time() - max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"}) > 10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"}) for: 5m labels: severity: info
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069