Bug 2266316

Summary:	PrometheusRuleFailures alert after installation or upgrade
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Filip Balák <fbalak>
Component:	ceph-monitoring	Assignee:	arun kumar mohan <amohan>
Status:	CLOSED ERRATA	QA Contact:	Filip Balák <fbalak>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.15	CC:	amohan, kbg, muagarwa, nthomas, odf-bz-bot
Target Milestone:	---	Keywords:	Regression
Target Release:	ODF 4.16.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	.PrometheusRuleFailures alert after installation or upgrade Previously, Ceph quorum related alerts were not seen as prometheus failure alert, `PrometheusRuleFailures` was fired, which is usually fired when the queries produced ambiguous results. In a multi-cluster scenario, queries in `quorum-alert` rules were giving indistinguishable results, as it could not identify from which cluster the quorum alerts were fired. With this fix, a unique `managedBy` label was added to each query in quorum rules so that the query results contained the data about the cluster name from which the result was received. As a result, prometheus failure is not fired and the clusters are able to trigger all the Ceph mon quorum related alerts.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-07-17 13:14:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2262943
Bug Blocks:	2260844

Description Filip Balák 2024-02-27 15:11:32 UTC

Description of problem (please be detailed as possible and provide log
snippests):
The alert is raised after installation only for cluster with configuration VSPHERE IPI IN-TRANSIT ENCRYPTION 1AZ RHCOS VSAN COMPACT MODE 3M 0W Cluster.

Used ocs-ci configuration: https://github.com/red-hat-storage/ocs-ci/blob/master/conf/deployment/vsphere/ipi_1az_rhcos_vsan_compact_mode_3m_0w_intransit_encryption.yaml

The alert:
{'labels': {'alertname': 'PrometheusRuleFailures', 'container': 'kube-rbac-proxy', 'endpoint': 'metrics', 'instance': '10.128.0.89:9092', 'job': 'prometheus-k8s', 'namespace': 'openshift-monitoring', 'pod': 'prometheus-k8s-1', 'rule_group': '/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-storage-prometheus-ceph-rules-b892e77b-d533-4511-8332-36159b2d5ce4.yaml;quorum-alert.rules', 'service': 'prometheus-k8s', 'severity': 'warning'}, 'annotations': {'description': 'Prometheus openshift-monitoring/prometheus-k8s-1 has failed to evaluate 5 rules in the last 5m.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusRuleFailures.md', 'summary': 'Prometheus is failing rule evaluations.'}, 'state': 'pending', 'activeAt': '2024-02-26T20:19:58.400392087Z', 'value': '5.16962962962963e+00'}


Version of all relevant components (if applicable):
ocs 4.15.0-149

Can this issue reproducible?in-transit encryption 1az 
yes, all test runs with this configuration:
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18847/916378/916405/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18847/916378/916405/log?logParams=history%3D895246%26page.page%3D1
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18847/916378/916405/log?logParams=history%3D889041%26page.page%3D1

Steps to Reproduce:
1. Install cluster on vsphere ipi with following configuration:
  platform: 'vsphere'
  deployment_type: 'ipi'
  worker_replicas: 0
  master_replicas: 3
  master_num_cpus: '16'
  master_memory: '65536'
  fio_storageutilization_min_mbps: 10.0
  in_transit_encryption: true
2. Check alerts


Actual results:
There is an alert PrometheusRuleFailures for rule_group /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-storage-prometheus-ceph-rules-b892e77b-d533-4511-8332-36159b2d5ce4.yaml

Expected results:
There isn't any PrometheusRuleFailures alert.

Additional info:

Comment 2 Nishanth Thomas 2024-02-27 16:40:46 UTC

Not a blocker. Moving to 4.16

Comment 3 Filip Balák 2024-03-13 09:41:22 UTC

The alert started to appear with other configurations including aws ipi, azure ipi, gcp ipi and with clusters without encryption.
e.g.:
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19369/941049/941073/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19359/940442/940473/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19203/932821/932845/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18899/918955/918986/log

Comment 5 arun kumar mohan 2024-03-20 07:22:59 UTC

@filip, we have a BZ reported: https://bugzilla.redhat.com/show_bug.cgi?id=2262943 for similar issue, but unable to repro it.
Can you please check whether this issue is happening with the latest 4.15 or 4.16 clusters?
If it is still happening, can you please provide me a setup to take a look.
Thanks,
Arun

Comment 6 arun kumar mohan 2024-05-06 13:33:09 UTC

PR: https://github.com/red-hat-storage/ocs-operator/pull/2596
should fix any issue raised because of multiple clusters and should prevent any 'PrometheusRuleFailures' errors.
The same fix is provided for BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2262943, which has similar traits.

@Filip, can you please comment whether we had multiple clusters at play when this error/failure happened.

Thanks,
Arun

Comment 9 Filip Balák 2024-05-07 07:55:30 UTC

(In reply to arun kumar mohan from comment #6)
> PR: https://github.com/red-hat-storage/ocs-operator/pull/2596
> should fix any issue raised because of multiple clusters and should prevent
> any 'PrometheusRuleFailures' errors.
> The same fix is provided for BZ:
> https://bugzilla.redhat.com/show_bug.cgi?id=2262943, which has similar
> traits.
> 
> @Filip, can you please comment whether we had multiple clusters at play when
> this error/failure happened.
> 
> Thanks,
> Arun

I am looking at regression run list on test_prometheus_rule_failures test case: https://docs.google.com/spreadsheets/d/1akrwspvWglSs905x2JcydJNH08WO6Ptri-hrkZ2VO80/edit#gid=40270420&range=F703
I don't see a problem with multicluster but it looks to me that all IPI instances (both AWS and VSPHERE) has the issue but all UPI instances don't have the problem.

Encryption (in-transit, thales kms) also can cause this issue but encryption at rest doesn't seem to trigger the issue (but this can be a coincidence due to job configurations).

Comment 10 arun kumar mohan 2024-05-14 07:33:59 UTC

In the above PR#2596, we have updated the queries.
Moving this to modified for now.
Once the bug is triaged for verification, we can confirm the fix (or not).

Comment 11 arun kumar mohan 2024-05-14 09:57:13 UTC

4.16 backport PR: https://github.com/red-hat-storage/ocs-operator/pull/2608

Comment 15 arun kumar mohan 2024-05-29 16:04:21 UTC

Added RDT details, please take a look...

Comment 16 Filip Balák 2024-05-30 09:35:15 UTC

The fix is only partial. It helped to fix the issue with following configuration:
VSPHERE IPI IN-TRANSIT ENCRYPTION 1AZ RHCOS VSAN COMPACT MODE 3M 0W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11835/)
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/21345/1018347/1018381/log
(although it could be a coincidence, the odf version there is 4.16.0-99)

but we still see it with configuration:
AWS IPI KMS THALES 1AZ RHCOS 3M 3W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11930/)

Tested with ODF 4.16.0-108

Comment 18 arun kumar mohan 2024-06-10 17:46:17 UTC

Even though the issue happened in two different parts (of the same prometheus-rule file) for BZs BZ#2266316 and BZ#2262943, the underlying is same.
We are currently triaging the issue through BZ#2262943 and once we have the setup (which has the issue) for further analysis, will mark these BZs duplicates.

Comment 23 errata-xmlrpc 2024-07-17 13:14:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591