Bug 1786696
| Summary: | UI->Dashboards->Overview->Alerts shows MON components are at different versions, though they are NOT | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Neha Berry <nberry> | ||||
| Component: | ceph-monitoring | Assignee: | arun kumar mohan <amohan> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Shrivaibavi Raghaventhiran <sraghave> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 4.2 | CC: | aberner, amohan, fbalak, hnallurv, jefbrown, jolmomar, mbukatov, muagarwa, nthomas, ocs-bugs, odf-bz-bot, swilson, tdesala | ||||
| Target Milestone: | --- | Keywords: | UpcomingSprint | ||||
| Target Release: | ODF 4.13.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | monitoring | ||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-06-21 15:22:14 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1811027, 2101497 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Neha Berry
2019-12-27 10:18:25 UTC
What's the next step here? After trying to reproduce this issue, I observed that it was happening due to metrics not being correctly updated by ceph-mgr. I have created https://bugzilla.redhat.com/show_bug.cgi?id=1811027 to track the issue. Also have set https://bugzilla.redhat.com/show_bug.cgi?id=1811027 as a blocker for this bug. Cannot be solved yet as the real issue https://bugzilla.redhat.com/show_bug.cgi?id=1811027 is still replicable. Should be moved for OCS 4.6 Created attachment 1711860 [details] mon_metadata metrics (In reply to Anmol Sachan from comment #5) > Cannot be solved yet as the real issue > https://bugzilla.redhat.com/show_bug.cgi?id=1811027 is still replicable. > Should be moved for OCS 4.6 +1 Anmol In an OCS 4.5.0-rc1(4.5.0-49.ci) setup, I had performed OCP upgrade which was stuck for a 1-2 days till I cleaned up the Terminating noobaa pods( due to a known bug - 1867762) After bringing back the cluster in good shape, it was seen that even though all MONs are on the same version, UI is showing an Alert as following: "Aug 18, 6:28 pm There are 2 different versions of Ceph Mon components running. View details "" On further troubleshooting with the help of ANmol, it was observed that ceph version & endpoint fields were blank for "mon.e". It seems MGR is not providing the correct information of the mon to the prometheus Logs - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bug-1786696-aug19-mon-version-alert/ Note: ------------- 1. No OCS upgrade was performed and the MON versions has been 14.2.8-91.el8cp from the start 2. Since nodes were affected during OCP upgrade and were in NotReady state, there were multiple restarts on MON pods and also, ultimately the cluster ended up with mon b,d,e. from ceph side ==================== sh-4.4# ceph -s cluster: id: 6da0f693-0893-4e2c-a004-06b5220b0632 health: HEALTH_OK services: mon: 3 daemons, quorum b,d,e (age 20h) mgr: a(active, since 20h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 20h), 3 in (since 5d) rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 16.00k objects, 60 GiB usage: 182 GiB used, 1.3 TiB / 1.5 TiB avail pgs: 176 active+clean io: client: 852 B/s rd, 265 KiB/s wr, 1 op/s rd, 1 op/s wr -------------------------------- sh-4.4# ceph mon versions { "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 3 } sh-4.4# The bug depends on https://bugzilla.redhat.com/show_bug.cgi?id=1811027 . Moving to 4.7.0. *** Bug 1893722 has been marked as a duplicate of this bug. *** OCP Alert "There are 2 different versions of Ceph Mon components running." is not reliable and can't be interpreted alone. One should check version values from `ceph_mon_metadata` metrics to see what triggered the alert, and then compare that with what ceph observes via `ceph versions` command from ocs toolbox pod. *** Bug 1953111 has been marked as a duplicate of this bug. *** Fix for bug #1811027 is present in RHCS 5.1, https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c54 Moving the BZ to ON_QA, please verify with the latest 4.10 build. Start a cluster of 3 mon + 3 osd, with an older release version of 4.9 release. Bring down one mon (you can bring the mon count to ZERO) and then upgrade the cluster. After upgradation is done, let the other mon be still down, as we are looking for the mon metadata (in the query) we should be seeing two different versions. PS: it requires 10mins for the alert to fire See if the alert is fired or not Then bring up the mon and see whether the mon is upgraded and the alert stays, even if the mon is updated. This is a very thin possibility, but let's try Not a 4.10 blocker, moving it out. Arun, PTAL @ https://bugzilla.redhat.com/show_bug.cgi?id=1786696#c29 (In reply to Prasad Desala from comment #29) > This one looks similar to > https://bugzilla.redhat.com/show_bug.cgi?id=1773594#c29 > Can you please check and let us know if we need any additional fix from ODF > side to verify this BZ? Prasad, these BZs (BZ#1786696 this and BZ#1773594) look similar, but have slight differences. As stated in https://bugzilla.redhat.com/show_bug.cgi?id=1786696#c16 by Anmol, this BZ is about the false positive (raises the alert but not needed) and BZ#1773594 is about negative case (where alerts are not triggered even when there IS an issue). At that point a ceph BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1811027) was considered as the root cause. But since that is fixed and both these BZs are still repro-ing, need to take a look into ODF queries. Still not cornered on to the root cause, will try to add this in 4.11... Mon queries rely on query 'ceph_mon_metadata' and this query is not populated correctly. Have filed BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2101497 for this. Will pick this once the dependent BZ is addressed. Fixed in OCS-Op query changes (fix for BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1773594) and the root cause is fixed in ceph RADOS (fix for BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2101497), this can be considered completed (from devel perspective). Moving it to QA for verification. Tested on version: ------------------- OCP - 4.13.0-0.nightly-2023-05-16-154455 ODF - 4.13.0-201 Initial Image: ---------------- quay.io/rhceph-dev/rhceph@sha256:8c93b131317f8de70b20ba87ce45fe7b3203a0e7fd9b9790dd5f6c64d4dfd1e3 Test Steps: ----------- 1. Set a different image to one mon and observed the mon version mismatch alert in UI 2. Set a different image to one osd and observed both mon and osd version mismatch alert in UI 3. Reset the old image on mon and osd one by one and noticed alerts disappearing. [sraghave@localhost ~]$ oc set image deployment/rook-ceph-osd-2 osd=quay.io/rhceph-dev/rhceph@sha256:fa6d01cdef17bc32d2b95b8121b02f4d41adccc5ba8a9b95f38c97797ff6621f -n openshift-storage deployment.apps/rook-ceph-osd-2 image updated [sraghave@localhost ~]$ oc set image deployment/rook-ceph-mon-d mon=quay.io/rhceph-dev/rhceph@sha256:fa6d01cdef17bc32d2b95b8121b02f4d41adccc5ba8a9b95f38c97797ff6621f -n openshift-storage deployment.apps/rook-ceph-mon-d image updated With all the above observations, Moving the BZ to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742 |