2262943 – PrometheusRule evaluation failing for pool-quota.rules

Bug 2262943 - PrometheusRule evaluation failing for pool-quota.rules

Summary: PrometheusRule evaluation failing for pool-quota.rules

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	arun kumar mohan
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2260844 2266316
TreeView+	depends on / blocked

Reported:	2024-02-06 07:23 UTC by umanga
Modified:	2024-07-17 13:13 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.16.0-102
Doc Type:	Bug Fix
Doc Text:	.PrometheusRule evaluation failing for pool-quota rule Previously, none of the Ceph pool quota alerts were displayed because in a multi-cluster setup, 'PrometheusRuleFailures’ alert was fired due to `pool-quota` rules. The queries in the `pool-quota` section were unable to distinguish the cluster from which the alert was fired in a multi-cluster setup. With this fix, a `managedBy` label was added to all the queries in the `pool-quota` to generate unique results from each cluster. As a result, `PrometheusRuleFailures` alert is no longer seen and all the alerts in `pool-quota` work as expected.
Clone Of:
Environment:
Last Closed:	2024-07-17 13:13:18 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2596	None	open	Fixed 'pool-quota.rules' queries for multicluster mode	2024-05-03 08:31:22 UTC
Github	red-hat-storage ocs-operator pull 2608	None	open	Bug 2262943: [release-4.16] Make all alerts/rules compatible with multicluster mode	2024-05-14 09:18:30 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:13:31 UTC

Internal Links: 2291298

Description umanga 2024-02-06 07:23:31 UTC

Created attachment 2015332 [details]
PrometheusRuleFailures alert details

Description of problem (please be detailed as possible and provide log
snippests):

Prometheus is continuously raising an alert "PrometheusRuleFailures" for "rule_group = /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-storage-prometheus-ceph-rules-57d3a51f-f3ed-4c66-ab52-8f74a5c6b2b9.yaml;pool-quota.rules"

Alert details:
```
Name: PrometheusRuleFailures

Severity: Warning

Description: Prometheus Namespace openshift-monitoring/Pod prometheus-k8s-1 has failed to evaluate 20 rules in the last 5m.

Summary: Prometheus is failing rule evaluations.

Runbook: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusRuleFailures.md
```

These rules come from ocs-operator: https://github.com/red-hat-storage/ocs-operator/blob/73ef95adeacf878338ffa1452d3857b410ee1172/controllers/storagecluster/prometheus/localcephrules.yaml#L340-L365

Version of all relevant components (if applicable):
ODF 4.15.0
OCP 4.15.0-rc.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Probably, this alert was not seen before.

Steps to Reproduce:
1. Deploy ODF
2. Create StorageCluster
3. Check Observe > Alerting > Alerts or Status card in Home > Overview


Actual results:
PrometheusRuleFailures alert is raised.

Expected results:
PrometheusRule evaluations should not fail.

Additional info:
Attached screenshots

Comment 5 arun kumar mohan 2024-05-03 08:31:23 UTC

Added PR: https://github.com/red-hat-storage/ocs-operator/pull/2596

Comment 8 arun kumar mohan 2024-05-14 06:34:36 UTC

The PR, https://github.com/red-hat-storage/ocs-operator/pull/2596, is merged now...

Comment 14 arun kumar mohan 2024-05-29 15:17:56 UTC

Adding RDT details, please take a look.

Comment 15 Filip Balák 2024-05-30 09:36:29 UTC

The alert is still present with:
AWS IPI KMS THALES 1AZ RHCOS 3M 3W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11930/)

For more info https://bugzilla.redhat.com/show_bug.cgi?id=2266316

Tested with ODF 4.16.0-108

Comment 16 arun kumar mohan 2024-06-04 08:06:31 UTC

Hi Filip, 
I'm unable to repro this on a normal AWS cluster (without any KMS THALES configuration).
Since both the related BZs (BZ#2262943, this one, and BZ#2266316) are happening on this specific KMS Thales configuration, can we open up a new BZ only for this particular combo and close these BZs?
Please let me know, what you think.

Comment 17 Filip Balák 2024-06-10 10:13:22 UTC

With new regression test results it looks like there is no progress in the issue. The alert is present for most IPI deployments. We can close some of those bzs to have some fix in (it looks to help in one instance but this might be a coincidence) but the issue is still present: https://docs.google.com/spreadsheets/d/1akrwspvWglSs905x2JcydJNH08WO6Ptri-hrkZ2VO80/edit#gid=40270420&range=F705

Comment 18 Mudit Agarwal 2024-06-10 16:08:11 UTC

Arun/Filip, what are the next steps for this BZ and BZ#2266316.
Can we please take a decision?

Comment 19 arun kumar mohan 2024-06-10 17:20:30 UTC

Hi Mudit,
Had a chat with Filip and (most probably) will get a setup by tomorrow.
Will update with more details (on the fix).

Comment 25 errata-xmlrpc 2024-07-17 13:13:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.