Bug 2266316
Summary: | PrometheusRuleFailures alert after installation or upgrade | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Filip Balák <fbalak> |
Component: | ceph-monitoring | Assignee: | arun kumar mohan <amohan> |
Status: | CLOSED ERRATA | QA Contact: | Filip Balák <fbalak> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.15 | CC: | amohan, kbg, muagarwa, nthomas, odf-bz-bot |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | ODF 4.16.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
.PrometheusRuleFailures alert after installation or upgrade
Previously, Ceph quorum related alerts were not seen as prometheus failure alert, `PrometheusRuleFailures` was fired, which is usually fired when the queries produced ambiguous results. In a multi-cluster scenario, queries in `quorum-alert` rules were giving indistinguishable results, as it could not identify from which cluster the quorum alerts were fired.
With this fix, a unique `managedBy` label was added to each query in quorum rules so that the query results contained the data about the cluster name from which the result was received. As a result, prometheus failure is not fired and the clusters are able to trigger all the Ceph mon quorum related alerts.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2024-07-17 13:14:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2262943 | ||
Bug Blocks: | 2260844 |
Description
Filip Balák
2024-02-27 15:11:32 UTC
Not a blocker. Moving to 4.16 The alert started to appear with other configurations including aws ipi, azure ipi, gcp ipi and with clusters without encryption. e.g.: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19369/941049/941073/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19359/940442/940473/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19203/932821/932845/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/18899/918955/918986/log @filip, we have a BZ reported: https://bugzilla.redhat.com/show_bug.cgi?id=2262943 for similar issue, but unable to repro it. Can you please check whether this issue is happening with the latest 4.15 or 4.16 clusters? If it is still happening, can you please provide me a setup to take a look. Thanks, Arun PR: https://github.com/red-hat-storage/ocs-operator/pull/2596 should fix any issue raised because of multiple clusters and should prevent any 'PrometheusRuleFailures' errors. The same fix is provided for BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2262943, which has similar traits. @Filip, can you please comment whether we had multiple clusters at play when this error/failure happened. Thanks, Arun (In reply to arun kumar mohan from comment #6) > PR: https://github.com/red-hat-storage/ocs-operator/pull/2596 > should fix any issue raised because of multiple clusters and should prevent > any 'PrometheusRuleFailures' errors. > The same fix is provided for BZ: > https://bugzilla.redhat.com/show_bug.cgi?id=2262943, which has similar > traits. > > @Filip, can you please comment whether we had multiple clusters at play when > this error/failure happened. > > Thanks, > Arun I am looking at regression run list on test_prometheus_rule_failures test case: https://docs.google.com/spreadsheets/d/1akrwspvWglSs905x2JcydJNH08WO6Ptri-hrkZ2VO80/edit#gid=40270420&range=F703 I don't see a problem with multicluster but it looks to me that all IPI instances (both AWS and VSPHERE) has the issue but all UPI instances don't have the problem. Encryption (in-transit, thales kms) also can cause this issue but encryption at rest doesn't seem to trigger the issue (but this can be a coincidence due to job configurations). In the above PR#2596, we have updated the queries. Moving this to modified for now. Once the bug is triaged for verification, we can confirm the fix (or not). 4.16 backport PR: https://github.com/red-hat-storage/ocs-operator/pull/2608 Added RDT details, please take a look... The fix is only partial. It helped to fix the issue with following configuration: VSPHERE IPI IN-TRANSIT ENCRYPTION 1AZ RHCOS VSAN COMPACT MODE 3M 0W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11835/) https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/21345/1018347/1018381/log (although it could be a coincidence, the odf version there is 4.16.0-99) but we still see it with configuration: AWS IPI KMS THALES 1AZ RHCOS 3M 3W Cluster (https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/11930/) Tested with ODF 4.16.0-108 Even though the issue happened in two different parts (of the same prometheus-rule file) for BZs BZ#2266316 and BZ#2262943, the underlying is same. We are currently triaging the issue through BZ#2262943 and once we have the setup (which has the issue) for further analysis, will mark these BZs duplicates. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591 |