Created attachment 2015332 [details] PrometheusRuleFailures alert details Description of problem (please be detailed as possible and provide log snippests): Prometheus is continuously raising an alert "PrometheusRuleFailures" for "rule_group = /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-storage-prometheus-ceph-rules-57d3a51f-f3ed-4c66-ab52-8f74a5c6b2b9.yaml;pool-quota.rules" Alert details: ``` Name: PrometheusRuleFailures Severity: Warning Description: Prometheus Namespace openshift-monitoring/Pod prometheus-k8s-1 has failed to evaluate 20 rules in the last 5m. Summary: Prometheus is failing rule evaluations. Runbook: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusRuleFailures.md ``` These rules come from ocs-operator: https://github.com/red-hat-storage/ocs-operator/blob/73ef95adeacf878338ffa1452d3857b410ee1172/controllers/storagecluster/prometheus/localcephrules.yaml#L340-L365 Version of all relevant components (if applicable): ODF 4.15.0 OCP 4.15.0-rc.4 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Probably, this alert was not seen before. Steps to Reproduce: 1. Deploy ODF 2. Create StorageCluster 3. Check Observe > Alerting > Alerts or Status card in Home > Overview Actual results: PrometheusRuleFailures alert is raised. Expected results: PrometheusRule evaluations should not fail. Additional info: Attached screenshots
Added PR: https://github.com/red-hat-storage/ocs-operator/pull/2596
The PR, https://github.com/red-hat-storage/ocs-operator/pull/2596, is merged now...
Adding RDT details, please take a look.
The alert is still present with: AWS IPI KMS THALES 1AZ RHCOS 3M 3W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11930/) For more info https://bugzilla.redhat.com/show_bug.cgi?id=2266316 Tested with ODF 4.16.0-108
Hi Filip, I'm unable to repro this on a normal AWS cluster (without any KMS THALES configuration). Since both the related BZs (BZ#2262943, this one, and BZ#2266316) are happening on this specific KMS Thales configuration, can we open up a new BZ only for this particular combo and close these BZs? Please let me know, what you think.
With new regression test results it looks like there is no progress in the issue. The alert is present for most IPI deployments. We can close some of those bzs to have some fix in (it looks to help in one instance but this might be a coincidence) but the issue is still present: https://docs.google.com/spreadsheets/d/1akrwspvWglSs905x2JcydJNH08WO6Ptri-hrkZ2VO80/edit#gid=40270420&range=F705
Arun/Filip, what are the next steps for this BZ and BZ#2266316. Can we please take a decision?
Hi Mudit, Had a chat with Filip and (most probably) will get a setup by tomorrow. Will update with more details (on the fix).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591