Bug 2262943

Summary: PrometheusRule evaluation failing for pool-quota.rules
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: umanga <uchapaga>
Component: ceph-monitoringAssignee: arun kumar mohan <amohan>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.15CC: amohan, fbalak, kbg, muagarwa, nthomas, odf-bz-bot
Target Milestone: ---   
Target Release: ODF 4.16.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.16.0-102 Doc Type: Bug Fix
Doc Text:
.PrometheusRule evaluation failing for pool-quota rule Previously, none of the Ceph pool quota alerts were displayed because in a multi-cluster setup, 'PrometheusRuleFailures’ alert was fired due to `pool-quota` rules. The queries in the `pool-quota` section were unable to distinguish the cluster from which the alert was fired in a multi-cluster setup. With this fix, a `managedBy` label was added to all the queries in the `pool-quota` to generate unique results from each cluster. As a result, `PrometheusRuleFailures` alert is no longer seen and all the alerts in `pool-quota` work as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-07-17 13:13:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2260844, 2266316    

Description umanga 2024-02-06 07:23:31 UTC
Created attachment 2015332 [details]
PrometheusRuleFailures alert details

Description of problem (please be detailed as possible and provide log
snippests):

Prometheus is continuously raising an alert "PrometheusRuleFailures" for "rule_group = /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-storage-prometheus-ceph-rules-57d3a51f-f3ed-4c66-ab52-8f74a5c6b2b9.yaml;pool-quota.rules"

Alert details:
```
Name: PrometheusRuleFailures

Severity: Warning

Description: Prometheus Namespace openshift-monitoring/Pod prometheus-k8s-1 has failed to evaluate 20 rules in the last 5m.

Summary: Prometheus is failing rule evaluations.

Runbook: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusRuleFailures.md
```

These rules come from ocs-operator: https://github.com/red-hat-storage/ocs-operator/blob/73ef95adeacf878338ffa1452d3857b410ee1172/controllers/storagecluster/prometheus/localcephrules.yaml#L340-L365

Version of all relevant components (if applicable):
ODF 4.15.0
OCP 4.15.0-rc.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Probably, this alert was not seen before.

Steps to Reproduce:
1. Deploy ODF
2. Create StorageCluster
3. Check Observe > Alerting > Alerts or Status card in Home > Overview


Actual results:
PrometheusRuleFailures alert is raised.

Expected results:
PrometheusRule evaluations should not fail.

Additional info:
Attached screenshots

Comment 5 arun kumar mohan 2024-05-03 08:31:23 UTC
Added PR: https://github.com/red-hat-storage/ocs-operator/pull/2596

Comment 8 arun kumar mohan 2024-05-14 06:34:36 UTC
The PR, https://github.com/red-hat-storage/ocs-operator/pull/2596, is merged now...

Comment 14 arun kumar mohan 2024-05-29 15:17:56 UTC
Adding RDT details, please take a look.

Comment 15 Filip Balák 2024-05-30 09:36:29 UTC
The alert is still present with:
AWS IPI KMS THALES 1AZ RHCOS 3M 3W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11930/)

For more info https://bugzilla.redhat.com/show_bug.cgi?id=2266316

Tested with ODF 4.16.0-108

Comment 16 arun kumar mohan 2024-06-04 08:06:31 UTC
Hi Filip, 
I'm unable to repro this on a normal AWS cluster (without any KMS THALES configuration).
Since both the related BZs (BZ#2262943, this one, and BZ#2266316) are happening on this specific KMS Thales configuration, can we open up a new BZ only for this particular combo and close these BZs?
Please let me know, what you think.

Comment 17 Filip Balák 2024-06-10 10:13:22 UTC
With new regression test results it looks like there is no progress in the issue. The alert is present for most IPI deployments. We can close some of those bzs to have some fix in (it looks to help in one instance but this might be a coincidence) but the issue is still present: https://docs.google.com/spreadsheets/d/1akrwspvWglSs905x2JcydJNH08WO6Ptri-hrkZ2VO80/edit#gid=40270420&range=F705

Comment 18 Mudit Agarwal 2024-06-10 16:08:11 UTC
Arun/Filip, what are the next steps for this BZ and BZ#2266316.
Can we please take a decision?

Comment 19 arun kumar mohan 2024-06-10 17:20:30 UTC
Hi Mudit,
Had a chat with Filip and (most probably) will get a setup by tomorrow.
Will update with more details (on the fix).

Comment 25 errata-xmlrpc 2024-07-17 13:13:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591