Bug 2262943 - PrometheusRule evaluation failing for pool-quota.rules
Summary: PrometheusRule evaluation failing for pool-quota.rules
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.16.0
Assignee: arun kumar mohan
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks: 2260844 2266316
TreeView+ depends on / blocked
 
Reported: 2024-02-06 07:23 UTC by umanga
Modified: 2024-07-17 13:13 UTC (History)
6 users (show)

Fixed In Version: 4.16.0-102
Doc Type: Bug Fix
Doc Text:
.PrometheusRule evaluation failing for pool-quota rule Previously, none of the Ceph pool quota alerts were displayed because in a multi-cluster setup, 'PrometheusRuleFailures’ alert was fired due to `pool-quota` rules. The queries in the `pool-quota` section were unable to distinguish the cluster from which the alert was fired in a multi-cluster setup. With this fix, a `managedBy` label was added to all the queries in the `pool-quota` to generate unique results from each cluster. As a result, `PrometheusRuleFailures` alert is no longer seen and all the alerts in `pool-quota` work as expected.
Clone Of:
Environment:
Last Closed: 2024-07-17 13:13:18 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2596 0 None open Fixed 'pool-quota.rules' queries for multicluster mode 2024-05-03 08:31:22 UTC
Github red-hat-storage ocs-operator pull 2608 0 None open Bug 2262943: [release-4.16] Make all alerts/rules compatible with multicluster mode 2024-05-14 09:18:30 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:13:31 UTC

Internal Links: 2291298

Description umanga 2024-02-06 07:23:31 UTC
Created attachment 2015332 [details]
PrometheusRuleFailures alert details

Description of problem (please be detailed as possible and provide log
snippests):

Prometheus is continuously raising an alert "PrometheusRuleFailures" for "rule_group = /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-storage-prometheus-ceph-rules-57d3a51f-f3ed-4c66-ab52-8f74a5c6b2b9.yaml;pool-quota.rules"

Alert details:
```
Name: PrometheusRuleFailures

Severity: Warning

Description: Prometheus Namespace openshift-monitoring/Pod prometheus-k8s-1 has failed to evaluate 20 rules in the last 5m.

Summary: Prometheus is failing rule evaluations.

Runbook: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusRuleFailures.md
```

These rules come from ocs-operator: https://github.com/red-hat-storage/ocs-operator/blob/73ef95adeacf878338ffa1452d3857b410ee1172/controllers/storagecluster/prometheus/localcephrules.yaml#L340-L365

Version of all relevant components (if applicable):
ODF 4.15.0
OCP 4.15.0-rc.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Probably, this alert was not seen before.

Steps to Reproduce:
1. Deploy ODF
2. Create StorageCluster
3. Check Observe > Alerting > Alerts or Status card in Home > Overview


Actual results:
PrometheusRuleFailures alert is raised.

Expected results:
PrometheusRule evaluations should not fail.

Additional info:
Attached screenshots

Comment 5 arun kumar mohan 2024-05-03 08:31:23 UTC
Added PR: https://github.com/red-hat-storage/ocs-operator/pull/2596

Comment 8 arun kumar mohan 2024-05-14 06:34:36 UTC
The PR, https://github.com/red-hat-storage/ocs-operator/pull/2596, is merged now...

Comment 14 arun kumar mohan 2024-05-29 15:17:56 UTC
Adding RDT details, please take a look.

Comment 15 Filip Balák 2024-05-30 09:36:29 UTC
The alert is still present with:
AWS IPI KMS THALES 1AZ RHCOS 3M 3W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11930/)

For more info https://bugzilla.redhat.com/show_bug.cgi?id=2266316

Tested with ODF 4.16.0-108

Comment 16 arun kumar mohan 2024-06-04 08:06:31 UTC
Hi Filip, 
I'm unable to repro this on a normal AWS cluster (without any KMS THALES configuration).
Since both the related BZs (BZ#2262943, this one, and BZ#2266316) are happening on this specific KMS Thales configuration, can we open up a new BZ only for this particular combo and close these BZs?
Please let me know, what you think.

Comment 17 Filip Balák 2024-06-10 10:13:22 UTC
With new regression test results it looks like there is no progress in the issue. The alert is present for most IPI deployments. We can close some of those bzs to have some fix in (it looks to help in one instance but this might be a coincidence) but the issue is still present: https://docs.google.com/spreadsheets/d/1akrwspvWglSs905x2JcydJNH08WO6Ptri-hrkZ2VO80/edit#gid=40270420&range=F705

Comment 18 Mudit Agarwal 2024-06-10 16:08:11 UTC
Arun/Filip, what are the next steps for this BZ and BZ#2266316.
Can we please take a decision?

Comment 19 arun kumar mohan 2024-06-10 17:20:30 UTC
Hi Mudit,
Had a chat with Filip and (most probably) will get a setup by tomorrow.
Will update with more details (on the fix).

Comment 25 errata-xmlrpc 2024-07-17 13:13:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.