Bug 2040277

Summary: ThanosRuleNoEvaluationFor10Intervals alert description is wrong
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: medium    
Version: 4.10CC: amuller, anpicker, aos-bugs, hongyli
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:42:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ThanosRuleNoEvaluationFor10Intervals alert expr result in prometheus none

Description Junqi Zhao 2022-01-13 10:58:32 UTC
Created attachment 1850565 [details]
ThanosRuleNoEvaluationFor10Intervals alert expr result in prometheus

Description of problem:
came across ThanosRuleNoEvaluationFor10Intervals alert, the description is
"Thanos Rule thanos-ruler in openshift-user-workload-monitoring has 1.6G% rule groups that did not evaluate for at least 10x of their expected interval."
checked the alert detail:
1. {{$value | humanize}}% should be {{$value | humanize}}
2. the expr seems weird, not sure if is right

        - alert: ThanosRuleNoEvaluationFor10Intervals
          annotations:
            description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has {{$value
              | humanize}}% rule groups that did not evaluate for at least 10x of their
              expected interval.
            summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
          expr: |
            time() -  max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"})
            >
            10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"})
          for: 5m
          labels:
            severity: info

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-11-065245

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 hongyan li 2022-01-13 11:20:21 UTC
I feel the alert rule has no issue, this is not a bug.

Comment 2 hongyan li 2022-01-13 11:33:51 UTC
The description of the alert has issue

Comment 3 Prashant Balachandran 2022-01-27 11:34:36 UTC
Created upstream PR https://github.com/thanos-io/thanos/pull/5105

Comment 4 Prashant Balachandran 2022-03-01 11:42:06 UTC
Changes have been pulled into CMO with this PR
https://github.com/openshift/cluster-monitoring-operator/pull/1556/

Comment 7 Junqi Zhao 2022-06-06 03:09:27 UTC
tested with 4.11.0-0.nightly-2022-06-04-014713, ThanosRuleNoEvaluationFor10Intervals definition is updated to:
        - alert: ThanosRuleNoEvaluationFor10Intervals
          annotations:
            description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has rule groups
              that did not evaluate for at least 10x of their expected interval.
            summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
          expr: |
            time() -  max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"})
            >
            10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"})
          for: 5m
          labels:
            severity: info

Comment 11 errata-xmlrpc 2022-08-10 10:42:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069