Bug 1811027

Summary: Ceph-mgr is not updating ceph_version in ceph_*_metadata Prometheus metric reliably.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Anmol Sachan <asachan>
Component: RADOSAssignee: Prashant Dhange <pdhange>
Status: CLOSED CURRENTRELEASE QA Contact: skanta
Severity: urgent Docs Contact:
Priority: high    
Version: 4.0CC: akupczyk, amohan, assingh, bhubbard, branto, ceph-eng-bugs, ceph-qe-bugs, ChetRHosey, epuertat, jdurgin, mbukatov, muagarwa, nberry, nojha, nthomas, pdhange, pdhiran, rzarzyns, sangadi, skanta, sseshasa, tserlin, vereddy
Target Milestone: ---Keywords: Reopened
Target Release: 4.2z3Flags: skanta: needinfo? (kchai)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-14.2.11-157.el8cp, ceph-14.2.11-157.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-30 16:16:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1773594, 1786696    

Description Anmol Sachan 2020-03-06 13:04:29 UTC
Description of problem:

Ceph-mgr is not updating ceph_version in ceph_*_metadata metric without a restart.

Platform: Openshift 4.3
Product: OCS 4.2
Ceph Image used: registry.redhat.io/ocs4/rhceph-rhel8


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install OCS 4.2 on Openshift 4.3.
2. Check for ceph_mon_metadata and ceph_osd_metadata Prometheus metric
3. Update one of the mon and osd deployment images to another images. Used image : quay.io/ceph-ci/ceph:wip-sage3-testing-2020-03-05-2154. 

Might also be possible reproduce on standard rhcs by upgrading one of the mons and osd's.
4. Check for ceph_mon_metadata and ceph_osd_metadata Prometheus metric. It would be same. 
5. Restart the ceph-mgr and check the metrics again. The metrics would be updated.
6. Downgrade the mon and osd to the original versions. Check the metrics again. Ceph Mgr would not update the metrics to report the latest version.

Actual results:

ceph_version in ceph_*_metadata metric did not reflect the correct version running in the system.

Expected results:

ceph_version in ceph_*_metadata metric should reflect the correct version running in the system.

Additional info:


ceph versions command 
{
    "mon": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2,
        "ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)": 1
    },
    "mgr": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2,
        "ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)": 1
    },
    "mds": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 8,
        "ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)": 2
    }


Metric after version upgrades: 

ceph_mon_metadata{ceph_daemon="mon.a",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",endpoint="http-metrics",hostname="compute-1",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.21.187",rank="0",service="rook-ceph-mgr"}	1
ceph_mon_metadata{ceph_daemon="mon.b",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",endpoint="http-metrics",hostname="compute-0",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.178.215",rank="1",service="rook-ceph-mgr"}	1
ceph_mon_metadata{ceph_daemon="mon.d",ceph_version="ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)",endpoint="http-metrics",hostname="compute-2",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.45.1",rank="2",service="rook-ceph-mgr"}



ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.0",ceph_version="ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)",cluster_addr="10.128.2.55",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-2",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.128.2.55",service="rook-ceph-mgr"}	1
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.1",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",cluster_addr="10.129.2.35",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-0",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.129.2.35",service="rook-ceph-mgr"}	1
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.2",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",cluster_addr="10.130.2.32",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-1",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.130.2.32",service="rook-ceph-mgr"}



After Version restoring to original :

ceph versions
{
    "mon": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2
    },
    "mds": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 9
    }
}


Metrics :

ceph_mon_metadata{ceph_daemon="mon.a",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",endpoint="http-metrics",hostname="compute-1",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.21.187",rank="0",service="rook-ceph-mgr"}	1
ceph_mon_metadata{ceph_daemon="mon.b",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",endpoint="http-metrics",hostname="compute-0",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.178.215",rank="1",service="rook-ceph-mgr"}	1
ceph_mon_metadata{ceph_daemon="mon.d",ceph_version="ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)",endpoint="http-metrics",hostname="compute-2",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.45.1",rank="2",service="rook-ceph-mgr"}



ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.0",ceph_version="ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)",cluster_addr="10.128.2.55",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-2",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.128.2.55",service="rook-ceph-mgr"}	1
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.1",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",cluster_addr="10.129.2.35",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-0",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.129.2.35",service="rook-ceph-mgr"}	1
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.2",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",cluster_addr="10.130.2.32",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-1",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.130.2.32",service="rook-ceph-mgr"}

After this changed OSD failed to come up with the original version, but the metric did not change. It should have updated to 0.

Comment 1 Boris Ranto 2020-05-12 13:29:29 UTC
Hi Anmol,

what do you use to look at the metrics? Do you directly check the metrics exported by ceph-mgr prometheus plugin or do you look at the output of the prometheus server gathering the metrics?

Did you wait for some time (at least a minute, maybe more depending on where you are looking and your configuration) for the metrics to update?

This should no longer be happening. We have had a couple of patches for similar issues back in the luminous days but nowadays, all the metrics should get cleared before they are repopulated.

Regards,
Boris

Comment 2 Boris Ranto 2020-05-20 12:15:23 UTC
FYI: I couldn't reproduce this when I was testing this manually with the ceph-mgr exporter. The version metric did update as expected when I upgraded/downgraded a node/daemon. I suspect this is the case of prometheus queries not setting time-outs for information like this. You can make the query look just a couple minutes to the past with some modifiers, see the prometheus documentation:

https://prometheus.io/docs/prometheus/latest/querying/basics/

Otherwise, prometheus might look further to the past for the last known value for a metric that no longer exists (as the labels changed).

Comment 3 Anmol Sachan 2020-06-23 14:39:54 UTC
@boris I tried replicating the issue. 

I am still able to replicate this issue with OCS 4.4

ceph versions
{
    "mon": {
        "ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 2,
        "ceph version 16.0.0-2550-g7e0b165 (7e0b165c7bd27357d6b5b351f53c7a4b9c25ff08) pacific (dev)": 1
    },
    "mgr": {
        "ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 2,
        "ceph version 16.0.0-2550-g7e0b165 (7e0b165c7bd27357d6b5b351f53c7a4b9c25ff08) pacific (dev)": 1
    },
    "mds": {
        "ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 8,
        "ceph version 16.0.0-2550-g7e0b165 (7e0b165c7bd27357d6b5b351f53c7a4b9c25ff08) pacific (dev)": 2
    }
}


Output from Mgr not Promethues:

# HELP ceph_mon_metadata MON Metadata
# TYPE ceph_mon_metadata untyped
ceph_mon_metadata{ceph_daemon="mon.a",hostname="compute-0",public_addr="172.30.195.126",rank="0",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
ceph_mon_metadata{ceph_daemon="mon.b",hostname="compute-2",public_addr="172.30.201.105",rank="1",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
ceph_mon_metadata{ceph_daemon="mon.c",hostname="compute-1",public_addr="172.30.121.36",rank="2",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0


# HELP ceph_osd_metadata OSD Metadata
# TYPE ceph_osd_metadata untyped
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.0",cluster_addr="10.130.2.41",device_class="hdd",front_iface="eth0",hostname="compute-2",objectstore="bluestore",public_addr="10.130.2.41",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.1",cluster_addr="10.128.2.38",device_class="hdd",front_iface="eth0",hostname="compute-1",objectstore="bluestore",public_addr="10.128.2.38",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.2",cluster_addr="10.128.4.39",device_class="hdd",front_iface="eth0",hostname="compute-0",objectstore="bluestore",public_addr="10.128.4.39",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0

This time, even after mgr and waiting for 10 mins, the version did not update in the exporter metric.

Comment 4 Anmol Sachan 2020-06-23 14:41:01 UTC
> This time, even after mgr and waiting for 10 mins, the version did not
> update in the exporter metric.

After MGR RESTART, metrics did not update.

Comment 5 Boris Ranto 2020-06-23 18:56:07 UTC
This is starting to sound like this has got more to do with OCS than the prometheus exporter. Could it be that OCS gets auto-updated after the downgrade (or something similar)?

Basically, if the metrics do not update even after the mgr restart, we are not getting the correct data from the ceph-mgr server and we are relying on that for the metrics.

Also, how do you deploy ceph v16.0.0 in OCS?

Comment 6 Anmol Sachan 2020-06-24 17:42:21 UTC
@boris 

I am using OCS 4.4 on OCP 4.4. It is not OCS with V16 ceph.

I just upgraded the images in the deployment for one mon and osd.

For ceph v16.0.0 : pull quay.io/ceph-ci/ceph:master
for ceph v14 : quay.io/ceph-ci/ceph:nautilus

Comment 7 Boris Ranto 2020-06-24 20:39:56 UTC
So how do you update the image? Do you create a new CSV/appregistry container?

Comment 8 Anmol Sachan 2020-07-06 14:03:35 UTC
@boris sorry I missed the comment. I update the image in the Deployment of 1 of the OSD and MON after the normal deployment is complete.

Comment 9 Neha Berry 2020-08-19 11:14:53 UTC
Reproduced again - https://bugzilla.redhat.com/show_bug.cgi?id=1786696#c6

Comment 10 Boris Ranto 2020-08-19 13:02:04 UTC
If that is the actual issue here then this is not an issue in the prometheus module but mgr itself as that si where we are getting these data from, re-assinging to RADOS for better coverage.

Comment 11 Kefu Chai 2020-08-27 14:44:32 UTC
it looks like a dup of #1844206, will take a closer look at it tomorrow.

Comment 12 Kefu Chai 2020-08-28 06:54:10 UTC
at this moment, i want to focus on the discrepancies between the versions reported by "ceph version" and those reported by prometheus ceph-mgr module first.

the versions of mon reported by "ceph versions" comes from the MMonMetadata messages sent by peons. while the versions of mon reported by prometheus ceph-mgr module comes from the output of "ceph mon metadata" command sent by ceph-mgr when it launches.

i admit that we should update the metadata of monitors when monmap is updated. 

but per https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c4

> After MGR RESTART, metrics did not update.

this is weird. as "ceph mon metadata" is also served by monitor directly. and monitor looks up in the same source when serving this command as it does when serving "ceph version".

@Neha and @Anmol could paste the output of "ceph mon metadata" and "ceph version" before and after upgrading monitor?

Comment 13 Josh Durgin 2020-10-09 22:02:32 UTC
@Anmol any update?

Comment 16 Martin Bukatovic 2020-12-15 11:18:22 UTC
Yesterday, I noticed "There are 2 different versions of Ceph Mon components running." alert right after timeouted installation of OCS 4.7 on GCP, which otherwise looked fine (all ceph components were running, ceph health was ok). But:

- ceph monitors running were "a", "b" and "d" (so something happened to "c", but "d" took over)
- OCP Console reported alert "There are 2 different versions of Ceph Mon components running."
- no version mismash noticed in ceph versions output
- querying OCP Prometheus for `ceph_mon_metadata{job="rook-ceph-mgr"}` reveals that for mon.b, there are no values of ceph_version and hostname fields, which triggers the alert

OCP 4.6.0-0.nightly-2020-12-13-230909
OCS v4.6.0-195.ci

Comment 17 Boris Ranto 2020-12-15 12:58:11 UTC
@Martin: That is (probably) a different issue. Early on when ceph-mgr starts, the version (and/or hostname) field might not be populated, yet. It takes some time for ceph-mgr to gather all the data. We should just filter out these results in the alerting rule to avoid the version mismatch alert.

Alternatively, we could stop showing incomplete metadata metrics in the prometheus module. This could hide some other issues and have some unintended consequences.

Where is the alerting rule for OCP defined? Updating it should be fairly straight-forward.

@nthomas : IIRC, you worked on these rules? Where are they defined?

Comment 24 Kefu Chai 2021-01-15 10:28:29 UTC
@Anmol in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c3, you mentioned 

> This time, even after mgr and waiting for 10 mins, the version did not update in the exporter metric.

while in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c14, 

> After that I restarted the MGR pod and then the versions got updated in the Prometheus mgr metrics.

if restarting mgr helps to get the monitor versions right. then the root cause must be that we fail to update mon metadata with the monmap . as i put at https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c12 .

will create a patch to address the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c14. but i am wondering if this really addresses the issue, as it seems we have two different issues here.

Comment 25 Kefu Chai 2021-01-16 06:42:54 UTC
upstream PR pending on review: https://github.com/ceph/ceph/pull/38932

Comment 26 Yaniv Kaul 2021-02-11 08:57:16 UTC
(In reply to Kefu Chai from comment #25)
> upstream PR pending on review: https://github.com/ceph/ceph/pull/38932

Cool - this one is merged.
Pacific backport - https://github.com/ceph/ceph/pull/39218
Octopus backport - https://github.com/ceph/ceph/pull/39219

Are we going to have it in Nautilus as well?

Comment 27 Neha Ojha 2021-02-12 22:23:48 UTC
(In reply to Yaniv Kaul from comment #26)
> (In reply to Kefu Chai from comment #25)
> > upstream PR pending on review: https://github.com/ceph/ceph/pull/38932
> 
> Cool - this one is merged.
> Pacific backport - https://github.com/ceph/ceph/pull/39218
> Octopus backport - https://github.com/ceph/ceph/pull/39219
> 
> Are we going to have it in Nautilus as well?

yes, the PR has already merged https://github.com/ceph/ceph/pull/39075

Comment 32 skanta 2021-05-29 00:01:15 UTC
Below are the findings-

ceph-MGR(Working as expected)
------------------------------

   After updateing the ceph packges ceph-mgr is updateing at "ceph versions" and "ceph mgr metadata"

   Before update:-
   ---------------
    
                [root@skymaster ~]# ceph versions
                 {
                           ..............
                            },
                        "mgr": {
                                  "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 3
                               },
    
                               ...............................
                  }
                 [root@skymaster ~]#
               [root@skymaster ~]# ceph mgr metadata
[
    {
        "name": "harvard",
        "addr": "10.70.39.14",
        "addrs": "10.70.39.14:0/25069",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
        "ceph_version_short": "14.2.11-121.el7cp",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "distro": "rhel",
        "distro_description": "Storage",
        "distro_version": "7.9",
        "hostname": "harvard.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455672",
        "os": "Linux"
    },
    {
        "name": "havoc",
        "addr": "10.70.39.15",
        "addrs": "10.70.39.15:0/18841",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
        "ceph_version_short": "14.2.11-121.el7cp",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "distro": "rhel",
        "distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
        "distro_version": "7.9",
        "hostname": "havoc.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455676",
        "os": "Linux"
    },
    {
        "name": "skymaster",
        "addr": "10.70.39.19",
        "addrs": "10.70.39.19:0/15668",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
        "ceph_version_short": "14.2.11-121.el7cp",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "distro": "rhel",
        "distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
        "distro_version": "7.9",
        "hostname": "skymaster.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455692",
        "os": "Linux"
    }
]
[root@skymaster ~]#
   
After updating the packages:
---------------------------
[root@skymaster ~]# ceph versions
{
    "mon": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2,
        "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 77
    },
    "mds": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 4
    },
    "rbd-mirror": {
        "ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2
    },
    "rgw-nfs": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2,
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 89,
        "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1
    }
}
[root@skymaster ~]# ceph mgr metadata
[
    {
        "name": "harvard",
        "addr": "10.70.39.14",
        "addrs": "10.70.39.14:0/25069",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
        "ceph_version_short": "14.2.11-121.el7cp",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "distro": "rhel",
        "distro_description": "Storage",
        "distro_version": "7.9",
        "hostname": "harvard.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455672",
        "os": "Linux"
    },
    {
        "name": "havoc",
        "addr": "10.70.39.15",
        "addrs": "10.70.39.15:0/18841",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
        "ceph_version_short": "14.2.11-121.el7cp",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "distro": "rhel",
        "distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
        "distro_version": "7.9",
        "hostname": "havoc.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455676",
        "os": "Linux"
    },
    {
        "name": "skymaster",
        "addr": "10.70.39.19",
        "addrs": "10.70.39.19:0/14135",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)",
        "ceph_version_short": "14.2.11-176.el7cp",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "distro": "rhel",
        "distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
        "distro_version": "7.9",
        "hostname": "skymaster.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455692",
        "os": "Linux"
    }
]
[root@skymaster ~]#

ceph-MGR versions are updated and working as expected.

ceph-mon(Not working as expected):-
------------------------------------

After updating the ceph-mon packages the versions are not updated in the "metadata"  but version update is showing at "ceph versions" output. I tried with "systemctl reload <mon service>" but not updated at metadata.
After "systemctl restart <mon service>" in the metadata also the version is updated.

Below is the yum log snippet:-
------------------------------------

........................................
.......................................................
May 27 16:29:25 Updated: 2:ceph-selinux-14.2.11-176.el7cp.x86_64
May 27 16:29:26 Updated: 2:ceph-mgr-14.2.11-176.el7cp.x86_64
May 27 16:29:26 Updated: 2:ceph-grafana-dashboards-14.2.11-176.el7cp.noarch
May 27 16:29:28 Updated: 2:ceph-mgr-dashboard-14.2.11-176.el7cp.noarch
May 27 16:29:28 Updated: 2:ceph-mgr-diskprediction-local-14.2.11-176.el7cp.noarch
May 27 16:29:29 Updated: 2:ceph-mds-14.2.11-176.el7cp.x86_64
May 27 16:29:30 Updated: 2:ceph-mon-14.2.11-176.el7cp.x86_64
.................................................................................
....................................................................................

After update:-
--------------
[root@skymaster log]# ceph versions
{
    "mon": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1,
        "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1,
        "ceph version 14.2.11-177.el7cp (0486420967ea3327d3ba01d3184f3ab96ddaa616) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 77
    },
    "mds": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 4
    },
    "rbd-mirror": {
        "ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2
    },
    "rgw-nfs": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2,
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 88,
        "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1,
        "ceph version 14.2.11-177.el7cp (0486420967ea3327d3ba01d3184f3ab96ddaa616) nautilus (stable)": 1
    }
}

The same output after reload also.

ceph-mon versions updated after service restart:
-------------------------------------------------

After restart:-

[root@skymaster log]# ceph versions
{
    "mon": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2,
        "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1-------------------------------------> version updated
    },
    "mgr": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1,
        "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1,
        "ceph version 14.2.11-177.el7cp (0486420967ea3327d3ba01d3184f3ab96ddaa616) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 77
    },
    "mds": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 4
    },
    "rbd-mirror": {
        "ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2
    },
    "rgw-nfs": {
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2,
        "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 87,
        "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 2,
        "ceph version 14.2.11-177.el7cp (0486420967ea3327d3ba01d3184f3ab96ddaa616) nautilus (stable)": 1
    }
}
[root@skymaster log]#


[root@skymaster log]# ceph mon metadata --------------------------------------------->>restarted the service in skymaster
[
    {
        "name": "harvard",
        "addrs": "[v2:10.70.39.14:3300/0,v1:10.70.39.14:6789/0]",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
        "ceph_version_short": "14.2.11-121.el7cp",
        "compression_algorithms": "none, snappy, zlib, zstd, lz4",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "device_ids": "",
        "device_paths": "",
        "devices": "",
        "distro": "rhel",
        "distro_description": "Employee SKU",
        "distro_version": "7.9",
        "hostname": "harvard.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455672",
        "os": "Linux"
    },
    {
        "name": "havoc",
        "addrs": "[v2:10.70.39.15:3300/0,v1:10.70.39.15:6789/0]",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
        "ceph_version_short": "14.2.11-121.el7cp",
        "compression_algorithms": "none, snappy, zlib, zstd, lz4",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "device_ids": "",
        "device_paths": "",
        "devices": "",
        "distro": "rhel",
        "distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
        "distro_version": "7.9",
        "hostname": "havoc.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455676",
        "os": "Linux"
    },
    {
        "name": "skymaster",
        "addrs": "[v2:10.70.39.19:3300/0,v1:10.70.39.19:6789/0]",
        "arch": "x86_64",
        "ceph_release": "nautilus",
        "ceph_version": "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)",
        "ceph_version_short": "14.2.11-176.el7cp",------------------------------------------------------------------------------>>Version updated
        "compression_algorithms": "none, snappy, zlib, zstd, lz4",
        "cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "device_ids": "",
        "device_paths": "",
        "devices": "",
        "distro": "rhel",
        "distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
        "distro_version": "7.9",
        "hostname": "skymaster.lab.eng.blr.redhat.com",
        "kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
        "kernel_version": "3.10.0-1160.15.2.el7.x86_64",
        "mem_swap_kb": "16318460",
        "mem_total_kb": "32455692",
        "os": "Linux"
    }
]
[root@skymaster log]# 


The same with the other nodes.

In the ceph-mon shell script there is a step to restart the service after ceph-mon update.

ceph-mon Shell script:
----------------------

[root@skymaster log]# rpm -q --scripts ceph-mon
............................................................
.....................................................................
.....................................................................

if [ $1 -ge 1 ] ; then
  # Restart on upgrade, but only if "CEPH_AUTO_RESTART_ON_UPGRADE" is set to
  # "yes". In any case: if units are not running, do not touch them.
  SYSCONF_CEPH=/etc/sysconfig/ceph
  if [ -f $SYSCONF_CEPH -a -r $SYSCONF_CEPH ] ; then
    source $SYSCONF_CEPH
  fi
  if [ "X$CEPH_AUTO_RESTART_ON_UPGRADE" = "Xyes" ] ; then
    /usr/bin/systemctl try-restart ceph-mon@\*.service > /dev/null 2>&1 || :
  fi
fi
[root@skymaster log]#

Comment 33 skanta 2021-06-04 05:28:20 UTC
Please let me know when the Fix will be provided.The RC data is palled on 8-June

Comment 34 skanta 2021-06-10 00:05:43 UTC
@Neha- Please let me know this issue is a blocker for the release?
   
        Restarting the service is upgrading the  CEPH version in meta data.

Comment 44 Veera Raghava Reddy 2021-09-29 15:54:20 UTC
Hi Neha,
Looks like this Bug is in open state with no target date for fix.
RHCS 4.3 Dev freeze was on Sep 24. Is this planned for 4.3?

Comment 46 Prashant Dhange 2021-10-05 01:39:21 UTC
@skanta

(In reply to skanta from comment #32)
> Below are the findings-
> 
> ceph-MGR(Working as expected)
> ------------------------------
> 
>    After updateing the ceph packges ceph-mgr is updateing at "ceph versions"
> and "ceph mgr metadata"
> 
..
> ceph-MGR versions are updated and working as expected.
This confirms that the upstream PR#38932 has fixed this issue where mon metadata is not getting updated when monmap is updated.

> 
> ceph-mon(Not working as expected):-
> ------------------------------------
> 
> After updating the ceph-mon packages the versions are not updated in the
> "metadata"  but version update is showing at "ceph versions" output. I tried
> with "systemctl reload <mon service>" but not updated at metadata.
> After "systemctl restart <mon service>" in the metadata also the version is
> updated.
Do you still have this cluster handy ? The "systemctl reload <mon service>" keeps ceph-mon process running and asks ceph-mon to reload the ceph config (not the ceph-mon systemd unit file).

> 
> The same with the other nodes.
> 
> In the ceph-mon shell script there is a step to restart the service after
> ceph-mon update.
> 
> ceph-mon Shell script:
> ----------------------
> 
> [root@skymaster log]# rpm -q --scripts ceph-mon
> ............................................................
> .....................................................................
> .....................................................................
> 
> if [ $1 -ge 1 ] ; then
>   # Restart on upgrade, but only if "CEPH_AUTO_RESTART_ON_UPGRADE" is set to
>   # "yes". In any case: if units are not running, do not touch them.
>   SYSCONF_CEPH=/etc/sysconfig/ceph
>   if [ -f $SYSCONF_CEPH -a -r $SYSCONF_CEPH ] ; then
>     source $SYSCONF_CEPH
>   fi
>   if [ "X$CEPH_AUTO_RESTART_ON_UPGRADE" = "Xyes" ] ; then
>     /usr/bin/systemctl try-restart ceph-mon@\*.service > /dev/null 2>&1 || :
>   fi
> fi
> [root@skanta 

Restarting ceph (e.g ceph-mon here) services after ceph upgrade is by default disabled (It's controlled by CEPH_AUTO_RESTART_ON_UPGRADE which is set to "no" by default)
# grep -v '^#' /etc/sysconfig/ceph

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728

CEPH_AUTO_RESTART_ON_UPGRADE=no
CLUSTER=ceph

Comment 48 Prashant Dhange 2021-10-25 02:50:49 UTC
(In reply to skanta from comment #47)
> Cluster is not handy but I tried to reproduce the issue from 4.2 released
> version(ceph version 14.2.11-202.el8cp ) to the latest version(ceph version
> 14.2.11-203.el8cp).
> 
> Version updated successfully without restarting the service.
@skanta So we are no longer seeing the issue of version not getting updated after ceph upgrade and/or ceph daemon restart ? Btw we have to restart ceph daemons on the node after ceph packages are upgraded.

Are we good to move this BZ to verified state ?

Comment 50 arun kumar mohan 2021-10-28 12:58:01 UTC
Hi Skanta,
I was trying to recreate the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1773594 (which was blocked by this BZ), but I'm unable to get an RHCS image which has the fix.
The max/latest version I could get is '14.2.11-199.el8cp.x86_64', which is a month old and didn't seem to have the fix.
Can you please confirm which RHCS version has the fix and (if possible) point me to the image link?

@skanta

Comment 51 skanta 2021-10-30 14:15:41 UTC
The fix was provided in ceph-14.2.11-157.el8cp.If you require more details could you please contact development team.

Comment 52 Prashant Dhange 2021-12-23 14:58:38 UTC
(In reply to arun kumar mohan from comment #50)
> Hi Skanta,
> I was trying to recreate the BZ:
> https://bugzilla.redhat.com/show_bug.cgi?id=1773594 (which was blocked by
> this BZ), but I'm unable to get an RHCS image which has the fix.
> The max/latest version I could get is '14.2.11-199.el8cp.x86_64', which is a
> month old and didn't seem to have the fix.
> Can you please confirm which RHCS version has the fix and (if possible)
> point me to the image link?
> 
> @skanta
@amohan Are you still blocked on testing ? Are you able to get right RHCS image ?

Comment 55 arun kumar mohan 2022-03-11 06:33:40 UTC
Removing the needinfo as the BZ is verified