Bug 1786696 - UI->Dashboards->Overview->Alerts shows MON components are at different versions, though they are NOT
Summary: UI->Dashboards->Overview->Alerts shows MON components are at different versio...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.2
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ODF 4.13.0
Assignee: arun kumar mohan
QA Contact: Shrivaibavi Raghaventhiran
URL:
Whiteboard: monitoring
: 1893722 1953111 (view as bug list)
Depends On: 1811027 2101497
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-27 10:18 UTC by Neha Berry
Modified: 2023-08-09 16:37 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:22:14 UTC
Embargoed:


Attachments (Terms of Use)
mon_metadata metrics (96.93 KB, image/png)
2020-08-19 11:14 UTC, Neha Berry
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1773594 1 medium CLOSED [GSS]CephOSDVersionMismatch and CephMonVersionMismatch are not triggered and cleared reliably 2023-08-09 16:37:41 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:22:53 UTC

Internal Links: 1811027

Description Neha Berry 2019-12-27 10:18:25 UTC
Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------------
Dashboards->Overview page reflects following warning even though all 3 MONs are on same versions.
>> There are 2 different versions of Ceph Mon components running

The setup already has some issues(reported below), but it's MONs are running and all are in the same version - 14.2.4-69.el8cp. Agreed, that the OSDs are at different versions.  These are the 3 Alerts in UI:

Storage cluster is in error state for more than 10m.   ----> expected

There are 2 different versions of Ceph OSD components running. ----> expected

There are 2 different versions of Ceph Mon components running. -----> not expected as MONs are not at different versions



>> Some background:
+++++++++++++++++++++++

In a VMware+VSAN+CoreOS+3 OCS node setup, upon upgrade from RC1 to RC9 build, following two issues were observed:

>> 1. Only 2 of the 4 OSDs got upgraded to 14.2.4-69.el8cp (recovery took >1hr after re-spin of 1st set of OSDs) - similar to BZ#1786029

>> 2. After 2 MDS re-spins(upgrade), both MDS reported as Standby and ceph is in HEALTH_ERR state: BZ#1786542


>> Some outputs: --> Full output pasted in Additiona Info
+++++++++++++++++++++
sh-4.4# ceph status
  cluster:
    id:     0f5a378f-36f7-40c9-ba37-5af8a2f514e8
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
 
 

sh-4.4# ceph versions 
{
    "mon": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 3   <--all 3 mons at same version
    },
    "mgr": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.4-68.el8cp (56dc8251bc571344bddddfabebcc93abd10c4f4a) nautilus (stable)": 4, <-- 4 OSDs still at lower version(upgrade didnt proceed)
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2  <--- OSDs have a mismatch of versions
    },
    "mds": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.4-68.el8cp (56dc8251bc571344bddddfabebcc93abd10c4f4a) nautilus (stable)": 4,
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 9



Version of all relevant components (if applicable):
----------------------------------------------------------------------------

$ oc get catsrc ocs-catalogsource -n openshift-marketplace -o yaml|grep -i image
    mediatype: image/svg+xml
  image: quay.io/rhceph-dev/ocs-registry:4.2-rc9

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-12-22-150714   True        False         3d20h   Cluster version is 4.2.0-0.nightly-2019-12-22-150714

>> Before upgrade:  image: quay.io/rhceph-dev/ocs-registry:4.2-rc1


Mon Images:
====
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
----------------------------------------------------------------------------
No.

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------------------
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
----------------------------------------------------------------------------
3

Can this issue reproducible?
----------------------------------------------------------------------------
Not sure.

Can this issue reproduce from the UI?
----------------------------------------------------------------------------
This issue is sene in Dashboard page

If this is a regression, please provide more details to justify this:
----------------------------------------------------------------------------
Not sure as BZ#1786542 has so far been seen only once during upgrade

BZ#1786542 - ( [VMware+CoreOS+VSAN] - Both MDS marked as STANDBY & Cluster in ERR state on OCS upgrade from RC1 to RC9) 

Steps to Reproduce:
----------------------------------------------------------------------------
1. Create an COS cluster with RC1 build in a VMware VSAN CoreOS setup
2. Run some cephfs, RBD and RGW workloads(FIOm kernel-untar, buckets & file creations, etc). 
3. Start  Manual OCS upgrade from RC1 to RC9 using Official documentation.
4. If by any chance one faces the issues reported in BZ# BZ#1786029( Some OSDs not upgraded)  or BZ#1786542(MDS are Offline and damanged), check the UI->dashboards->Overview->Alerts page
5. You may see alerts that OSDs are at different versions, but if 3 MONs are at same version, we should not see soemthing like this:

 ++++++++ There are 2 different versions of Ceph Mon components running.+++++++


Actual results:
----------------------------------------------------------------------------
UI->Dashboards->OverView->Alerts message says "There are 2 different versions of Ceph Mon components running.". But the MONs are at same version(seen from ceph versions & oc describe pod 


Expected results:
----------------------------------------------------------------------------
Only first 2 Alerts are expected, not the third one. Attached screenshot


Additional info:
----------------------------------------------------------------------------
sh-4.4# ceph status
  cluster:
    id:     0f5a378f-36f7-40c9-ba37-5af8a2f514e8
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
 
  services:
    mon: 3 daemons, quorum a,b,c (age 22h)
    mgr: a(active, since 27h)
    mds: ocs-storagecluster-cephfilesystem:0/1 2 up:standby, 1 damaged
    osd: 6 osds: 6 up (since 22h), 6 in (since 3d)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  task status:
 
  data:
    pools:   10 pools, 104 pgs
    objects: 723.17k objects, 1.1 TiB
    usage:   3.0 TiB used, 9.0 TiB / 12 TiB avail
    pgs:     104 active+clean
 
  io:
    client:   5.0 MiB/s rd, 7.0 MiB/s wr, 508 op/s rd, 30 op/s wr



rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-55946cbbpm2xs
====
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4

rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-759c9876qqdnw
====
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
rook-ceph-mgr-a-756df58f57-z86qx
====
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
rook-ceph-mon-a-7f97b67988-2zbrr
====
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
rook-ceph-mon-b-75db9ccc9c-xvqdc
====
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
rook-ceph-mon-c-75bb46c846-7sfxh
====
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4
    Image:         quay.io/rhceph-dev/rhceph:4-7
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:4307e50b86f539b46f08fbb158cb7b6e33ba954aa1a1a15d86f1471d39ab60f4

Comment 3 Yaniv Kaul 2020-03-03 07:36:32 UTC
What's the next step here?

Comment 4 Anmol Sachan 2020-03-06 13:20:18 UTC
After trying to reproduce this issue, I observed that it was happening due to metrics not being correctly updated by ceph-mgr. I have created https://bugzilla.redhat.com/show_bug.cgi?id=1811027 to track the issue. 
Also have set https://bugzilla.redhat.com/show_bug.cgi?id=1811027 as a blocker for this bug.

Comment 5 Anmol Sachan 2020-06-23 14:44:10 UTC
Cannot be solved yet as the real issue https://bugzilla.redhat.com/show_bug.cgi?id=1811027 is still replicable. Should be moved for OCS 4.6

Comment 6 Neha Berry 2020-08-19 11:14:00 UTC
Created attachment 1711860 [details]
mon_metadata metrics

(In reply to Anmol Sachan from comment #5)
> Cannot be solved yet as the real issue
> https://bugzilla.redhat.com/show_bug.cgi?id=1811027 is still replicable.
> Should be moved for OCS 4.6

+1 Anmol

In an OCS 4.5.0-rc1(4.5.0-49.ci) setup, I had performed OCP upgrade which was stuck for a 1-2 days till I cleaned up the Terminating noobaa pods( due to a known bug - 1867762)

After bringing back the cluster in good shape, it was seen that even though all MONs are on the same version, UI is showing an Alert as following:

"Aug 18, 6:28 pm
There are 2 different versions of Ceph Mon components running.
View details
""

On further troubleshooting with the help of ANmol, it was observed that ceph version & endpoint fields were  blank for "mon.e". It seems MGR is not providing the correct information of the mon to the prometheus

Logs - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bug-1786696-aug19-mon-version-alert/


Note:
-------------
1. No OCS upgrade was performed and the MON versions has been 14.2.8-91.el8cp from the start
2. Since nodes were affected during OCP upgrade and were in NotReady state, there were multiple restarts on MON pods and also, ultimately the cluster ended up with mon b,d,e. 

from ceph side
====================

sh-4.4# ceph -s
  cluster:
    id:     6da0f693-0893-4e2c-a004-06b5220b0632
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum b,d,e (age 20h)
    mgr: a(active, since 20h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 20h), 3 in (since 5d)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 176 pgs
    objects: 16.00k objects, 60 GiB
    usage:   182 GiB used, 1.3 TiB / 1.5 TiB avail
    pgs:     176 active+clean
 
  io:
    client:   852 B/s rd, 265 KiB/s wr, 1 op/s rd, 1 op/s wr


--------------------------------

sh-4.4# ceph mon versions
{
    "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 3
}
sh-4.4#

Comment 7 Anmol Sachan 2020-08-23 16:27:51 UTC
The bug depends on https://bugzilla.redhat.com/show_bug.cgi?id=1811027 . Moving to 4.7.0.

Comment 8 Mudit Agarwal 2020-11-06 13:32:47 UTC
*** Bug 1893722 has been marked as a duplicate of this bug. ***

Comment 9 Martin Bukatovic 2020-12-15 11:43:04 UTC
OCP Alert "There are 2 different versions of Ceph Mon components running." is not reliable and can't be interpreted alone. One should check version values from `ceph_mon_metadata` metrics to see what triggered the alert, and then compare that with what ceph observes via `ceph versions` command from ocs toolbox pod.

Comment 14 Nishanth Thomas 2021-04-26 05:55:09 UTC
*** Bug 1953111 has been marked as a duplicate of this bug. ***

Comment 21 Mudit Agarwal 2022-02-14 12:13:45 UTC
Fix for bug #1811027 is present in RHCS 5.1, https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c54
Moving the BZ to ON_QA, please verify with the latest 4.10 build.

Comment 28 arun kumar mohan 2022-04-05 13:46:40 UTC
Start a cluster of 3 mon + 3 osd, with an older release version of 4.9 release.
Bring down one mon (you can bring the mon count to ZERO) and then upgrade the cluster.
After upgradation is done, let the other mon be still down, as we are looking for the mon metadata (in the query) we should be seeing two different versions.
PS: it requires 10mins for the alert to fire

See if the alert is fired or not
Then bring up the mon and see whether the mon is upgraded and the alert stays, even if the mon is updated.

This is a very thin possibility, but let's try

Comment 32 Mudit Agarwal 2022-04-11 11:33:42 UTC
Not a 4.10 blocker, moving it out.

Arun, PTAL @ https://bugzilla.redhat.com/show_bug.cgi?id=1786696#c29

Comment 38 arun kumar mohan 2022-05-27 14:34:59 UTC
(In reply to Prasad Desala from comment #29)
> This one looks similar to
> https://bugzilla.redhat.com/show_bug.cgi?id=1773594#c29 
> Can you please check and let us know if we need any additional fix from ODF
> side to verify this BZ?

Prasad, these BZs (BZ#1786696 this and BZ#1773594) look similar, but have slight differences. As stated in https://bugzilla.redhat.com/show_bug.cgi?id=1786696#c16 by Anmol, this BZ is about the false positive (raises the alert but not needed) and BZ#1773594 is about negative case (where alerts are not triggered even when there IS an issue). At that point a ceph BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1811027) was considered as the root cause. But since that is fixed and both these BZs are still repro-ing, need to take a look into ODF queries.

Comment 40 arun kumar mohan 2022-06-09 13:15:37 UTC
Still not cornered on to the root cause, will try to add this in 4.11...

Comment 42 arun kumar mohan 2022-06-27 15:49:28 UTC
Mon queries rely on query 'ceph_mon_metadata' and this query is not populated correctly.
Have filed BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2101497 for this.
Will pick this once the dependent BZ is addressed.

Comment 59 arun kumar mohan 2023-04-12 08:13:31 UTC
Fixed in OCS-Op query changes (fix for BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1773594) and the root cause is fixed in ceph RADOS (fix for BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2101497), this can be considered completed (from devel perspective).
Moving it to QA for verification.

Comment 61 Shrivaibavi Raghaventhiran 2023-05-19 06:59:45 UTC
Tested on version:
-------------------
OCP - 4.13.0-0.nightly-2023-05-16-154455
ODF - 4.13.0-201

Initial Image:
----------------
quay.io/rhceph-dev/rhceph@sha256:8c93b131317f8de70b20ba87ce45fe7b3203a0e7fd9b9790dd5f6c64d4dfd1e3

Test Steps:
-----------
1. Set a different image to one mon and observed the mon version mismatch alert in UI
2. Set a different image to one osd and observed both mon and osd version mismatch alert in UI
3. Reset the old image on mon and osd one by one and noticed alerts disappearing.

[sraghave@localhost ~]$ oc set image deployment/rook-ceph-osd-2 osd=quay.io/rhceph-dev/rhceph@sha256:fa6d01cdef17bc32d2b95b8121b02f4d41adccc5ba8a9b95f38c97797ff6621f -n openshift-storage
deployment.apps/rook-ceph-osd-2 image updated
[sraghave@localhost ~]$ oc set image deployment/rook-ceph-mon-d mon=quay.io/rhceph-dev/rhceph@sha256:fa6d01cdef17bc32d2b95b8121b02f4d41adccc5ba8a9b95f38c97797ff6621f -n openshift-storage
deployment.apps/rook-ceph-mon-d image updated


With all the above observations, Moving the BZ to verified state.

Comment 62 errata-xmlrpc 2023-06-21 15:22:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.