Description of problem (please be detailed as possible and provide log snippests): ceph_mon_metada metrics are not collected properly/correctly. This was noticed when alert, CephMonVersionMismatch was not fired properly when one of the mon's image was changed. Here we can see that 'ceph versions' show that one of the mon's version is not the same as the other 2. ``` sh-4.4$ ceph mon versions { "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 2, "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1 } ``` But for the `ceph_mon_metadata` query below, ``` count by (ceph_daemon, namespace, ceph_version) (ceph_mon_metadata{job="rook-ceph-mgr", ceph_version != ""}) ``` > mon.a ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable) openshift-storage 1 > mon.b ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable) openshift-storage 1 > mon.c ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable) openshift-storage 1 we could see that all the mons are in the same ceph version (when using 'ceph_mon_metadata' query). Another misreporting is noticed while we change an image of an OSD, then 'ceph_mon_metadata' is showing multiple mon versions (even though we haven't touched mon images and 'ceph versions' shows clearly all the mon versions are same). This is depicted in BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1786696 Version of all relevant components (if applicable): OCP : 4.11.0-0.nightly-2022-06-15-222801 ODF : 4.10.4-2 Ceph: 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Not sure Steps to Reproduce: 1. Created an AWS Openshift cluster version: 4.11 stable 2. Installed (through operator-hub) ODF operator (default which is in the hub) 3. Created a storagecluster with default configs 4. Through command line, changed one of the mon image to an old one ``` oc set -n openshift-storage image deployment/rook-ceph-mon-a mon=quay.io/rhceph-dev/rhceph@sha256:e909b345d88459d49b691b7d484f604653fcba53b37bbc00e86fb09b26ed5205 ``` 5. Once that is complete, checked through OCP-Console-UI->Observer->Metrics and checked the `ceph-mon-metadata` query Actual results: ceph_mon_metadata query gives a false result, stating all the mons are in the same version Expected results: ceph_mon_metadata should provide exact version information including that of changed mon Additional info:
Not a 4.11 blocker
Providing dev ack with the conditional flag as the ceph bz is targeted for 5.3z1
Moving it out of 4.12, if BZ #2008524 is fixed in 5.3z1 then it can be brought back to 4.12
4.12 consumes 5.3z1 so I can move it back to 4.12 This also means that the fix is present in 6.1, right? Then it can also be targeted for 4.13 which means we can keep this BZ for 4.13 and create a clone for 4.12 I am providing a devel_ack for 4.13 and will create a clone for 4.12
Sunil, can we create a 4.12 clone for this?
Thanks Mudit. (In reply to Mudit Agarwal from comment #21) > 4.12 consumes 5.3z1 so I can move it back to 4.12 > > This also means that the fix is present in 6.1, right? Then it can also be > targeted for 4.13 which means we can keep this BZ for 4.13 and create a > clone for 4.12 Yes, that's right. The fix is in RHCS 6.1. Moving this BZ to 4.13. > I am providing a devel_ack for 4.13 and will create a clone for 4.12
Fixed in version: 4.13.0-121
Alert and metrics are updated correctly. --> VERIFIED Tested with odf 4.13.0-179
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742