Bug 1970354 - Handle empty ceph_version in ceph_mon_metadata to avoid raising misleading alert
Summary: Handle empty ceph_version in ceph_mon_metadata to avoid raising misleading alert
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ODF 4.9.0
Assignee: gowtham
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-10 10:52 UTC by Martin Bukatovic
Modified: 2023-08-09 17:03 UTC (History)
6 users (show)

Fixed In Version: v4.9.0-193.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-07 17:46:31 UTC
Embargoed:


Attachments (Terms of Use)
screenshot #1: example of the problem reproduced with OCS 4.8 (288.30 KB, image/png)
2021-08-25 18:25 UTC, Martin Bukatovic
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 304 0 None open Bug 1970354: ceph: handle empty ceph_version in ceph_mon_metadata 2021-10-18 09:25:16 UTC
Github rook rook pull 8947 0 None open Handle empty ceph_version in ceph_mon_metadata to avoid raising misleading alert 2021-10-10 14:38:24 UTC

Description Martin Bukatovic 2021-06-10 10:52:49 UTC
Description of problem
======================

When at least one metric in result of `ceph_mon_metadata{job="rook-ceph-mgr"}`
query is missing value of ceph_version, alert "There are 2 different versions
of Ceph Mon components running." is raised even though this is not the case.

Reported based on:

- My observation noted in comment:
  https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c16
- Suggestion from Boris Ranto to update alerting rules to handle such case:
  https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c17
- Request in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c23

Version-Release number of selected component
============================================

OCP 4.6.0-0.nightly-2020-12-13-230909
OCS v4.6.0-195.ci

How reproducible
================

Don't know.

That said, should be reproducible when the edge case (as explained) happens.

Steps to Reproduce
==================

1. result of `ceph_mon_metadata{job="rook-ceph-mgr"}` query contains at leasts
   one metric with empty ceph_version value (I don't know how to reproduce
   such scenario)
2. make sure that all components uses the same ceph version

Actual results
==============

Alert "There are 2 different versions of Ceph Mon components running." is
raised, even though this is not the case.

Expected results
================

Based on Boris Ranto's suggestion:

We should just filter out these results in the alerting rule to avoid the
version mismatch alert.

Alternatively, we could stop showing incomplete metadata metrics in the
prometheus module. This could hide some other issues and have some unintended
consequences.

Comment 2 Nishanth Thomas 2021-06-10 15:51:38 UTC
We have two Bzs to track the issue:

https://bugzilla.redhat.com/show_bug.cgi?id=1786696
https://bugzilla.redhat.com/show_bug.cgi?id=1773594

Both of these depends on the ceph BZ 1811027 . 

Is there a different issue you are trying to address here?

Comment 4 Martin Bukatovic 2021-06-11 09:38:11 UTC
(In reply to Nishanth Thomas from comment #2)
> We have two Bzs to track the issue:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1786696
> https://bugzilla.redhat.com/show_bug.cgi?id=1773594
> 
> Both of these depends on the ceph BZ 1811027 . 
> 
> Is there a different issue you are trying to address here?

This bz 1970354 is related to both bugs you reference, as it belongs
the the same area of ocs alerting. But these bugs were opened long time
ago and are still not fully resolved, as there is lot of problems
involved, including the ceph BZ 1811027.

This BZ 1970354 represents a simple change which could be implemented
outright to limit false negative alert in a particular case. It's also
not related to ceph BZ 1811027 - both bugs needs to be addressed.
This bug won't fix neither bz 1786696 nor bz 1773594 as the list
of issues with ocs alerting in this area is just too long.

For the reasoning and the change, see the description of this bug.

I agree that it's not a blocker, and that it should be fixed in 4.9.

Comment 5 Martin Bukatovic 2021-08-25 18:25:46 UTC
Created attachment 1817552 [details]
screenshot #1: example of the problem reproduced with OCS 4.8

Attaching screenshot of the problem, reproduced with:

- OCP 4.8.0-0.nightly-2021-08-23-125038
- LSO 4.8.0-202107291502
- OCS 4.8.1-177.ci

Comment 7 gowtham 2021-09-30 12:50:07 UTC
I can't reproduce this alert but I can see ceph_version goes blank when I degrade mon deployment count. After 5 - 10 mins a new deployment is created automatically and new mon is coming up and the ceph version is removed from metadata for the old mon around for 2 mins.

I can filter this mon by editing an alerting rule a little bit: 
  count(count by(ceph_version) (ceph_mon_metadata{job="rook-ceph-mgr", ceph_version  !=""})) > 1

So this high-level testing gives confidence like this solution works.

Comment 17 Mudit Agarwal 2021-10-13 08:42:56 UTC
Will move to MODIFIED once it is merged in downstream

Comment 19 Raz Tamir 2021-11-23 09:47:34 UTC
Mitigation plan: moving to VERIFIED low sev bug


Note You need to log in before you can comment on or make changes to this bug.