Bug 2040277

Summary:

ThanosRuleNoEvaluationFor10Intervals alert description is wrong

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Monitoring

Assignee:

Simon Pasquier <spasquie>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

low

Docs Contact:

Priority:

medium

Version:

4.10

CC:

amuller, anpicker, aos-bugs, hongyli

Target Milestone:

---

Target Release:

4.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-08-10 10:42:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ThanosRuleNoEvaluationFor10Intervals alert expr result in prometheus	none

Description Junqi Zhao 2022-01-13 10:58:32 UTC

Created attachment 1850565 [details]
ThanosRuleNoEvaluationFor10Intervals alert expr result in prometheus

Description of problem:
came across ThanosRuleNoEvaluationFor10Intervals alert, the description is
"Thanos Rule thanos-ruler in openshift-user-workload-monitoring has 1.6G% rule groups that did not evaluate for at least 10x of their expected interval."
checked the alert detail:
1. {{$value | humanize}}% should be {{$value | humanize}}
2. the expr seems weird, not sure if is right

        - alert: ThanosRuleNoEvaluationFor10Intervals
          annotations:
            description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has {{$value
              | humanize}}% rule groups that did not evaluate for at least 10x of their
              expected interval.
            summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
          expr: |
            time() -  max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"})
            >
            10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"})
          for: 5m
          labels:
            severity: info

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-11-065245

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 hongyan li 2022-01-13 11:20:21 UTC

I feel the alert rule has no issue, this is not a bug.

Comment 2 hongyan li 2022-01-13 11:33:51 UTC

The description of the alert has issue

Comment 3 Prashant Balachandran 2022-01-27 11:34:36 UTC

Created upstream PR https://github.com/thanos-io/thanos/pull/5105

Comment 4 Prashant Balachandran 2022-03-01 11:42:06 UTC

Changes have been pulled into CMO with this PR
https://github.com/openshift/cluster-monitoring-operator/pull/1556/

Comment 7 Junqi Zhao 2022-06-06 03:09:27 UTC

tested with 4.11.0-0.nightly-2022-06-04-014713, ThanosRuleNoEvaluationFor10Intervals definition is updated to:
        - alert: ThanosRuleNoEvaluationFor10Intervals
          annotations:
            description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has rule groups
              that did not evaluate for at least 10x of their expected interval.
            summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
          expr: |
            time() -  max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"})
            >
            10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"})
          for: 5m
          labels:
            severity: info

Comment 11 errata-xmlrpc 2022-08-10 10:42:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069