Bug 1970354

Summary: Handle empty ceph_version in ceph_mon_metadata to avoid raising misleading alert
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Martin Bukatovic <mbukatov>
Component: rookAssignee: gowtham <gshanmug>
Status: CLOSED CURRENTRELEASE QA Contact: Filip Balák <fbalak>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.6CC: madam, muagarwa, nthomas, ocs-bugs, odf-bz-bot, ratamir
Target Milestone: ---   
Target Release: ODF 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.9.0-193.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-07 17:46:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot #1: example of the problem reproduced with OCS 4.8 none

Description Martin Bukatovic 2021-06-10 10:52:49 UTC
Description of problem
======================

When at least one metric in result of `ceph_mon_metadata{job="rook-ceph-mgr"}`
query is missing value of ceph_version, alert "There are 2 different versions
of Ceph Mon components running." is raised even though this is not the case.

Reported based on:

- My observation noted in comment:
  https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c16
- Suggestion from Boris Ranto to update alerting rules to handle such case:
  https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c17
- Request in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c23

Version-Release number of selected component
============================================

OCP 4.6.0-0.nightly-2020-12-13-230909
OCS v4.6.0-195.ci

How reproducible
================

Don't know.

That said, should be reproducible when the edge case (as explained) happens.

Steps to Reproduce
==================

1. result of `ceph_mon_metadata{job="rook-ceph-mgr"}` query contains at leasts
   one metric with empty ceph_version value (I don't know how to reproduce
   such scenario)
2. make sure that all components uses the same ceph version

Actual results
==============

Alert "There are 2 different versions of Ceph Mon components running." is
raised, even though this is not the case.

Expected results
================

Based on Boris Ranto's suggestion:

We should just filter out these results in the alerting rule to avoid the
version mismatch alert.

Alternatively, we could stop showing incomplete metadata metrics in the
prometheus module. This could hide some other issues and have some unintended
consequences.

Comment 2 Nishanth Thomas 2021-06-10 15:51:38 UTC
We have two Bzs to track the issue:

https://bugzilla.redhat.com/show_bug.cgi?id=1786696
https://bugzilla.redhat.com/show_bug.cgi?id=1773594

Both of these depends on the ceph BZ 1811027 . 

Is there a different issue you are trying to address here?

Comment 4 Martin Bukatovic 2021-06-11 09:38:11 UTC
(In reply to Nishanth Thomas from comment #2)
> We have two Bzs to track the issue:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1786696
> https://bugzilla.redhat.com/show_bug.cgi?id=1773594
> 
> Both of these depends on the ceph BZ 1811027 . 
> 
> Is there a different issue you are trying to address here?

This bz 1970354 is related to both bugs you reference, as it belongs
the the same area of ocs alerting. But these bugs were opened long time
ago and are still not fully resolved, as there is lot of problems
involved, including the ceph BZ 1811027.

This BZ 1970354 represents a simple change which could be implemented
outright to limit false negative alert in a particular case. It's also
not related to ceph BZ 1811027 - both bugs needs to be addressed.
This bug won't fix neither bz 1786696 nor bz 1773594 as the list
of issues with ocs alerting in this area is just too long.

For the reasoning and the change, see the description of this bug.

I agree that it's not a blocker, and that it should be fixed in 4.9.

Comment 5 Martin Bukatovic 2021-08-25 18:25:46 UTC
Created attachment 1817552 [details]
screenshot #1: example of the problem reproduced with OCS 4.8

Attaching screenshot of the problem, reproduced with:

- OCP 4.8.0-0.nightly-2021-08-23-125038
- LSO 4.8.0-202107291502
- OCS 4.8.1-177.ci

Comment 7 gowtham 2021-09-30 12:50:07 UTC
I can't reproduce this alert but I can see ceph_version goes blank when I degrade mon deployment count. After 5 - 10 mins a new deployment is created automatically and new mon is coming up and the ceph version is removed from metadata for the old mon around for 2 mins.

I can filter this mon by editing an alerting rule a little bit: 
  count(count by(ceph_version) (ceph_mon_metadata{job="rook-ceph-mgr", ceph_version  !=""})) > 1

So this high-level testing gives confidence like this solution works.

Comment 17 Mudit Agarwal 2021-10-13 08:42:56 UTC
Will move to MODIFIED once it is merged in downstream

Comment 19 Raz Tamir 2021-11-23 09:47:34 UTC
Mitigation plan: moving to VERIFIED low sev bug