Bug 1970354
| Summary: | Handle empty ceph_version in ceph_mon_metadata to avoid raising misleading alert | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Martin Bukatovic <mbukatov> | ||||
| Component: | rook | Assignee: | gowtham <gshanmug> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Filip Balák <fbalak> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.6 | CC: | madam, muagarwa, nthomas, ocs-bugs, odf-bz-bot, ratamir | ||||
| Target Milestone: | --- | ||||||
| Target Release: | ODF 4.9.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | v4.9.0-193.ci | Doc Type: | No Doc Update | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-01-07 17:46:31 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
We have two Bzs to track the issue: https://bugzilla.redhat.com/show_bug.cgi?id=1786696 https://bugzilla.redhat.com/show_bug.cgi?id=1773594 Both of these depends on the ceph BZ 1811027 . Is there a different issue you are trying to address here? (In reply to Nishanth Thomas from comment #2) > We have two Bzs to track the issue: > > https://bugzilla.redhat.com/show_bug.cgi?id=1786696 > https://bugzilla.redhat.com/show_bug.cgi?id=1773594 > > Both of these depends on the ceph BZ 1811027 . > > Is there a different issue you are trying to address here? This bz 1970354 is related to both bugs you reference, as it belongs the the same area of ocs alerting. But these bugs were opened long time ago and are still not fully resolved, as there is lot of problems involved, including the ceph BZ 1811027. This BZ 1970354 represents a simple change which could be implemented outright to limit false negative alert in a particular case. It's also not related to ceph BZ 1811027 - both bugs needs to be addressed. This bug won't fix neither bz 1786696 nor bz 1773594 as the list of issues with ocs alerting in this area is just too long. For the reasoning and the change, see the description of this bug. I agree that it's not a blocker, and that it should be fixed in 4.9. Created attachment 1817552 [details]
screenshot #1: example of the problem reproduced with OCS 4.8
Attaching screenshot of the problem, reproduced with:
- OCP 4.8.0-0.nightly-2021-08-23-125038
- LSO 4.8.0-202107291502
- OCS 4.8.1-177.ci
I can't reproduce this alert but I can see ceph_version goes blank when I degrade mon deployment count. After 5 - 10 mins a new deployment is created automatically and new mon is coming up and the ceph version is removed from metadata for the old mon around for 2 mins.
I can filter this mon by editing an alerting rule a little bit:
count(count by(ceph_version) (ceph_mon_metadata{job="rook-ceph-mgr", ceph_version !=""})) > 1
So this high-level testing gives confidence like this solution works.
Will move to MODIFIED once it is merged in downstream Mitigation plan: moving to VERIFIED low sev bug |
Description of problem ====================== When at least one metric in result of `ceph_mon_metadata{job="rook-ceph-mgr"}` query is missing value of ceph_version, alert "There are 2 different versions of Ceph Mon components running." is raised even though this is not the case. Reported based on: - My observation noted in comment: https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c16 - Suggestion from Boris Ranto to update alerting rules to handle such case: https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c17 - Request in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c23 Version-Release number of selected component ============================================ OCP 4.6.0-0.nightly-2020-12-13-230909 OCS v4.6.0-195.ci How reproducible ================ Don't know. That said, should be reproducible when the edge case (as explained) happens. Steps to Reproduce ================== 1. result of `ceph_mon_metadata{job="rook-ceph-mgr"}` query contains at leasts one metric with empty ceph_version value (I don't know how to reproduce such scenario) 2. make sure that all components uses the same ceph version Actual results ============== Alert "There are 2 different versions of Ceph Mon components running." is raised, even though this is not the case. Expected results ================ Based on Boris Ranto's suggestion: We should just filter out these results in the alerting rule to avoid the version mismatch alert. Alternatively, we could stop showing incomplete metadata metrics in the prometheus module. This could hide some other issues and have some unintended consequences.