Bug 2040277
| Summary: | ThanosRuleNoEvaluationFor10Intervals alert description is wrong | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||
| Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4.10 | CC: | amuller, anpicker, aos-bugs, hongyli | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.11.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-08-10 10:42:08 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
I feel the alert rule has no issue, this is not a bug. The description of the alert has issue Created upstream PR https://github.com/thanos-io/thanos/pull/5105 Changes have been pulled into CMO with this PR https://github.com/openshift/cluster-monitoring-operator/pull/1556/ tested with 4.11.0-0.nightly-2022-06-04-014713, ThanosRuleNoEvaluationFor10Intervals definition is updated to:
- alert: ThanosRuleNoEvaluationFor10Intervals
annotations:
description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has rule groups
that did not evaluate for at least 10x of their expected interval.
summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
expr: |
time() - max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"})
>
10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"})
for: 5m
labels:
severity: info
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |
Created attachment 1850565 [details] ThanosRuleNoEvaluationFor10Intervals alert expr result in prometheus Description of problem: came across ThanosRuleNoEvaluationFor10Intervals alert, the description is "Thanos Rule thanos-ruler in openshift-user-workload-monitoring has 1.6G% rule groups that did not evaluate for at least 10x of their expected interval." checked the alert detail: 1. {{$value | humanize}}% should be {{$value | humanize}} 2. the expr seems weird, not sure if is right - alert: ThanosRuleNoEvaluationFor10Intervals annotations: description: Thanos Rule {{$labels.job}} in {{$labels.namespace}} has {{$value | humanize}}% rule groups that did not evaluate for at least 10x of their expected interval. summary: Thanos Rule has rule groups that did not evaluate for 10 intervals. expr: | time() - max by (namespace, job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job="thanos-ruler"}) > 10 * max by (namespace, job, instance, group) (prometheus_rule_group_interval_seconds{job="thanos-ruler"}) for: 5m labels: severity: info Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-11-065245 How reproducible: always Steps to Reproduce: 1. see the description 2. 3. Actual results: Expected results: Additional info: