Bug 2181119

Summary: ceph_mon_metadata metrics are not collected properly
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sunil Kumar Acharya <sheggodu>
Component: cephAssignee: Neha Ojha <nojha>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bniver, kramdoss, muagarwa, ocs-bugs, odf-bz-bot, pdhange, sostapov
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-04-13 11:11:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sunil Kumar Acharya 2023-03-23 06:51:10 UTC
This bug was initially created as a copy of Bug #2101497

I am copying this bug because: 



Description of problem (please be detailed as possible and provide log
snippests):
ceph_mon_metada metrics are not collected properly/correctly. This was noticed when alert, CephMonVersionMismatch was not fired properly when one of the mon's image was changed.

Here we can see that 'ceph versions' show that one of the mon's version is not the same as the other 2.
```
sh-4.4$ ceph mon versions
{
    "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 2,
    "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1
}
```

But for the `ceph_mon_metadata` query below,

```
count by (ceph_daemon, namespace, ceph_version) (ceph_mon_metadata{job="rook-ceph-mgr", ceph_version != ""})
```

> mon.a	ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)	openshift-storage	1
> mon.b	ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)	openshift-storage	1
> mon.c	ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)	openshift-storage	1

we could see that all the mons are in the same ceph version (when using 'ceph_mon_metadata' query).

Another misreporting is noticed while we change an image of an OSD, then 'ceph_mon_metadata' is showing multiple mon versions (even though we haven't touched mon images and 'ceph versions' shows clearly all the mon versions are same). This is depicted in BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1786696

Version of all relevant components (if applicable):
OCP :     4.11.0-0.nightly-2022-06-15-222801
ODF :     4.10.4-2
Ceph:     16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Not sure

Steps to Reproduce:
1. Created an AWS Openshift cluster version: 4.11 stable
2. Installed (through operator-hub) ODF operator (default which is in the hub)
3. Created a storagecluster with default configs
4. Through command line, changed one of the mon image to an old one
```
oc set -n openshift-storage image deployment/rook-ceph-mon-a mon=quay.io/rhceph-dev/rhceph@sha256:e909b345d88459d49b691b7d484f604653fcba53b37bbc00e86fb09b26ed5205
```
5. Once that is complete, checked through OCP-Console-UI->Observer->Metrics and checked the `ceph-mon-metadata` query

Actual results:
ceph_mon_metadata query gives a false result, stating all the mons are in the same version

Expected results:
ceph_mon_metadata should provide exact version information including that of changed mon

Additional info:

Comment 1 Prashant Dhange 2023-03-23 21:02:40 UTC
The BZ##2008524 has been fixed in RHCS 5.3z1 (ceph-16.2.10-138) release (refer errata https://access.redhat.com/errata/RHSA-2023:0980 for more details). Moving this BZ to 4.12 release.

Comment 3 krishnaram Karthick 2023-04-13 11:11:30 UTC
QE efforts here is regression only and the RHCS 5.3z1 was already shipped in 4.12.1. 
so, closing the bug as closed current release.