Bug 2101497 - ceph_mon_metadata metrics are not collected properly
Summary: ceph_mon_metadata metrics are not collected properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.13.0
Assignee: Prashant Dhange
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On: 2008524
Blocks: 1786696
TreeView+ depends on / blocked
 
Reported: 2022-06-27 15:45 UTC by arun kumar mohan
Modified: 2023-08-09 16:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:22:14 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:23:07 UTC

Description arun kumar mohan 2022-06-27 15:45:44 UTC
Description of problem (please be detailed as possible and provide log
snippests):
ceph_mon_metada metrics are not collected properly/correctly. This was noticed when alert, CephMonVersionMismatch was not fired properly when one of the mon's image was changed.

Here we can see that 'ceph versions' show that one of the mon's version is not the same as the other 2.
```
sh-4.4$ ceph mon versions
{
    "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 2,
    "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1
}
```

But for the `ceph_mon_metadata` query below,

```
count by (ceph_daemon, namespace, ceph_version) (ceph_mon_metadata{job="rook-ceph-mgr", ceph_version != ""})
```

> mon.a	ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)	openshift-storage	1
> mon.b	ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)	openshift-storage	1
> mon.c	ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)	openshift-storage	1

we could see that all the mons are in the same ceph version (when using 'ceph_mon_metadata' query).

Another misreporting is noticed while we change an image of an OSD, then 'ceph_mon_metadata' is showing multiple mon versions (even though we haven't touched mon images and 'ceph versions' shows clearly all the mon versions are same). This is depicted in BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1786696

Version of all relevant components (if applicable):
OCP :     4.11.0-0.nightly-2022-06-15-222801
ODF :     4.10.4-2
Ceph:     16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Not sure

Steps to Reproduce:
1. Created an AWS Openshift cluster version: 4.11 stable
2. Installed (through operator-hub) ODF operator (default which is in the hub)
3. Created a storagecluster with default configs
4. Through command line, changed one of the mon image to an old one
```
oc set -n openshift-storage image deployment/rook-ceph-mon-a mon=quay.io/rhceph-dev/rhceph@sha256:e909b345d88459d49b691b7d484f604653fcba53b37bbc00e86fb09b26ed5205
```
5. Once that is complete, checked through OCP-Console-UI->Observer->Metrics and checked the `ceph-mon-metadata` query

Actual results:
ceph_mon_metadata query gives a false result, stating all the mons are in the same version

Expected results:
ceph_mon_metadata should provide exact version information including that of changed mon

Additional info:

Comment 3 Mudit Agarwal 2022-07-05 13:20:43 UTC
Not a 4.11 blocker

Comment 5 Mudit Agarwal 2022-10-26 03:33:45 UTC
Providing dev ack with the conditional flag as the ceph bz is targeted for 5.3z1

Comment 7 Mudit Agarwal 2022-11-03 02:31:32 UTC
Moving it out of 4.12, if BZ #2008524 is fixed in 5.3z1 then it can be brought back to 4.12

Comment 21 Mudit Agarwal 2023-03-23 05:51:25 UTC
4.12 consumes 5.3z1 so I can move it back to 4.12

This also means that the fix is present in 6.1, right? Then it can also be targeted for 4.13 which means we can keep this BZ for 4.13 and create a clone for 4.12
I am providing a devel_ack for 4.13 and will create a clone for 4.12

Comment 22 Mudit Agarwal 2023-03-23 05:57:03 UTC
Sunil, can we create a 4.12 clone for this?

Comment 24 Prashant Dhange 2023-03-23 21:01:14 UTC
Thanks Mudit.

(In reply to Mudit Agarwal from comment #21)
> 4.12 consumes 5.3z1 so I can move it back to 4.12
> 
> This also means that the fix is present in 6.1, right? Then it can also be
> targeted for 4.13 which means we can keep this BZ for 4.13 and create a
> clone for 4.12
Yes, that's right. The fix is in RHCS 6.1. Moving this BZ to 4.13. 

> I am providing a devel_ack for 4.13 and will create a clone for 4.12

Comment 26 Mudit Agarwal 2023-04-03 10:51:17 UTC
Fixed in version: 4.13.0-121

Comment 30 Filip Balák 2023-05-03 09:01:24 UTC
Alert and metrics are updated correctly. --> VERIFIED

Tested with odf 4.13.0-179

Comment 31 errata-xmlrpc 2023-06-21 15:22:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.