Description of problem (please be detailed as possible and provide log snippests): The alert is raised after installation only for cluster with configuration VSPHERE IPI IN-TRANSIT ENCRYPTION 1AZ RHCOS VSAN COMPACT MODE 3M 0W Cluster. Used ocs-ci configuration: https://github.com/red-hat-storage/ocs-ci/blob/master/conf/deployment/vsphere/ipi_1az_rhcos_vsan_compact_mode_3m_0w_intransit_encryption.yaml The alert: {'labels': {'alertname': 'PrometheusRuleFailures', 'container': 'kube-rbac-proxy', 'endpoint': 'metrics', 'instance': '10.128.0.89:9092', 'job': 'prometheus-k8s', 'namespace': 'openshift-monitoring', 'pod': 'prometheus-k8s-1', 'rule_group': '/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-storage-prometheus-ceph-rules-b892e77b-d533-4511-8332-36159b2d5ce4.yaml;quorum-alert.rules', 'service': 'prometheus-k8s', 'severity': 'warning'}, 'annotations': {'description': 'Prometheus openshift-monitoring/prometheus-k8s-1 has failed to evaluate 5 rules in the last 5m.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusRuleFailures.md', 'summary': 'Prometheus is failing rule evaluations.'}, 'state': 'pending', 'activeAt': '2024-02-26T20:19:58.400392087Z', 'value': '5.16962962962963e+00'} Version of all relevant components (if applicable): ocs 4.15.0-149 Can this issue reproducible?in-transit encryption 1az yes, all test runs with this configuration: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18847/916378/916405/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18847/916378/916405/log?logParams=history%3D895246%26page.page%3D1 https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18847/916378/916405/log?logParams=history%3D889041%26page.page%3D1 Steps to Reproduce: 1. Install cluster on vsphere ipi with following configuration: platform: 'vsphere' deployment_type: 'ipi' worker_replicas: 0 master_replicas: 3 master_num_cpus: '16' master_memory: '65536' fio_storageutilization_min_mbps: 10.0 in_transit_encryption: true 2. Check alerts Actual results: There is an alert PrometheusRuleFailures for rule_group /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-storage-prometheus-ceph-rules-b892e77b-d533-4511-8332-36159b2d5ce4.yaml Expected results: There isn't any PrometheusRuleFailures alert. Additional info:
Not a blocker. Moving to 4.16
The alert started to appear with other configurations including aws ipi, azure ipi, gcp ipi and with clusters without encryption. e.g.: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19369/941049/941073/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19359/940442/940473/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19203/932821/932845/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18899/918955/918986/log
@filip, we have a BZ reported: https://bugzilla.redhat.com/show_bug.cgi?id=2262943 for similar issue, but unable to repro it. Can you please check whether this issue is happening with the latest 4.15 or 4.16 clusters? If it is still happening, can you please provide me a setup to take a look. Thanks, Arun
PR: https://github.com/red-hat-storage/ocs-operator/pull/2596 should fix any issue raised because of multiple clusters and should prevent any 'PrometheusRuleFailures' errors. The same fix is provided for BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2262943, which has similar traits. @Filip, can you please comment whether we had multiple clusters at play when this error/failure happened. Thanks, Arun
(In reply to arun kumar mohan from comment #6) > PR: https://github.com/red-hat-storage/ocs-operator/pull/2596 > should fix any issue raised because of multiple clusters and should prevent > any 'PrometheusRuleFailures' errors. > The same fix is provided for BZ: > https://bugzilla.redhat.com/show_bug.cgi?id=2262943, which has similar > traits. > > @Filip, can you please comment whether we had multiple clusters at play when > this error/failure happened. > > Thanks, > Arun I am looking at regression run list on test_prometheus_rule_failures test case: https://docs.google.com/spreadsheets/d/1akrwspvWglSs905x2JcydJNH08WO6Ptri-hrkZ2VO80/edit#gid=40270420&range=F703 I don't see a problem with multicluster but it looks to me that all IPI instances (both AWS and VSPHERE) has the issue but all UPI instances don't have the problem. Encryption (in-transit, thales kms) also can cause this issue but encryption at rest doesn't seem to trigger the issue (but this can be a coincidence due to job configurations).
In the above PR#2596, we have updated the queries. Moving this to modified for now. Once the bug is triaged for verification, we can confirm the fix (or not).
4.16 backport PR: https://github.com/red-hat-storage/ocs-operator/pull/2608
Added RDT details, please take a look...
The fix is only partial. It helped to fix the issue with following configuration: VSPHERE IPI IN-TRANSIT ENCRYPTION 1AZ RHCOS VSAN COMPACT MODE 3M 0W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11835/) https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/21345/1018347/1018381/log (although it could be a coincidence, the odf version there is 4.16.0-99) but we still see it with configuration: AWS IPI KMS THALES 1AZ RHCOS 3M 3W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11930/) Tested with ODF 4.16.0-108
Even though the issue happened in two different parts (of the same prometheus-rule file) for BZs BZ#2266316 and BZ#2262943, the underlying is same. We are currently triaging the issue through BZ#2262943 and once we have the setup (which has the issue) for further analysis, will mark these BZs duplicates.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591