Bug 2040277 - ThanosRuleNoEvaluationFor10Intervals alert description is wrong
Summary: ThanosRuleNoEvaluationFor10Intervals alert description is wrong
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.11.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-13 10:58 UTC by Junqi Zhao
Modified: 2022-08-10 10:42 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:42:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ThanosRuleNoEvaluationFor10Intervals alert expr result in prometheus (104.75 KB, image/png)
2022-01-13 10:58 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github thanos-io thanos pull 5105 0 None Merged fix: changing description to not include rule groups 2022-03-01 09:47:31 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:42:38 UTC

Description Junqi Zhao 2022-01-13 10:58:32 UTC
Created attachment 1850565 [details]
ThanosRuleNoEvaluationFor10Intervals alert expr result in prometheus

Description of problem:
came across ThanosRuleNoEvaluationFor10Intervals alert, the description is
"Thanos Rule thanos-ruler in openshift-user-workload-monitoring has 1.6G% rule groups that did not evaluate for at least 10x of their expected interval."
checked the alert detail:
1. {{$value | humanize}}% should be {{$value | humanize}}
2. the expr seems weird, not sure if is right

        - alert: ThanosRuleNoEvaluationFor10Intervals
          annotations:
            description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has {{$value
              | humanize}}% rule groups that did not evaluate for at least 10x of their
              expected interval.
            summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
          expr: |
            time() -  max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"})
            >
            10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"})
          for: 5m
          labels:
            severity: info

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-11-065245

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 hongyan li 2022-01-13 11:20:21 UTC
I feel the alert rule has no issue, this is not a bug.

Comment 2 hongyan li 2022-01-13 11:33:51 UTC
The description of the alert has issue

Comment 3 Prashant Balachandran 2022-01-27 11:34:36 UTC
Created upstream PR https://github.com/thanos-io/thanos/pull/5105

Comment 4 Prashant Balachandran 2022-03-01 11:42:06 UTC
Changes have been pulled into CMO with this PR
https://github.com/openshift/cluster-monitoring-operator/pull/1556/

Comment 7 Junqi Zhao 2022-06-06 03:09:27 UTC
tested with 4.11.0-0.nightly-2022-06-04-014713, ThanosRuleNoEvaluationFor10Intervals definition is updated to:
        - alert: ThanosRuleNoEvaluationFor10Intervals
          annotations:
            description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has rule groups
              that did not evaluate for at least 10x of their expected interval.
            summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
          expr: |
            time() -  max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"})
            >
            10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"})
          for: 5m
          labels:
            severity: info

Comment 11 errata-xmlrpc 2022-08-10 10:42:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.