Bug 1811027
| Summary: | Ceph-mgr is not updating ceph_version in ceph_*_metadata Prometheus metric reliably. | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Anmol Sachan <asachan> |
| Component: | RADOS | Assignee: | Prashant Dhange <pdhange> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | skanta |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.0 | CC: | akupczyk, amohan, assingh, bhubbard, branto, ceph-eng-bugs, ceph-qe-bugs, ChetRHosey, epuertat, jdurgin, mbukatov, muagarwa, nberry, nojha, nthomas, pdhange, pdhiran, rzarzyns, sangadi, skanta, sseshasa, tserlin, vereddy |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.2z3 | Flags: | skanta:
needinfo?
(kchai) |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-14.2.11-157.el8cp, ceph-14.2.11-157.el7cp | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-04-30 16:16:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1773594, 1786696 | ||
Hi Anmol, what do you use to look at the metrics? Do you directly check the metrics exported by ceph-mgr prometheus plugin or do you look at the output of the prometheus server gathering the metrics? Did you wait for some time (at least a minute, maybe more depending on where you are looking and your configuration) for the metrics to update? This should no longer be happening. We have had a couple of patches for similar issues back in the luminous days but nowadays, all the metrics should get cleared before they are repopulated. Regards, Boris FYI: I couldn't reproduce this when I was testing this manually with the ceph-mgr exporter. The version metric did update as expected when I upgraded/downgraded a node/daemon. I suspect this is the case of prometheus queries not setting time-outs for information like this. You can make the query look just a couple minutes to the past with some modifiers, see the prometheus documentation: https://prometheus.io/docs/prometheus/latest/querying/basics/ Otherwise, prometheus might look further to the past for the last known value for a metric that no longer exists (as the labels changed). @boris I tried replicating the issue.
I am still able to replicate this issue with OCS 4.4
ceph versions
{
"mon": {
"ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 2,
"ceph version 16.0.0-2550-g7e0b165 (7e0b165c7bd27357d6b5b351f53c7a4b9c25ff08) pacific (dev)": 1
},
"mgr": {
"ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 1
},
"osd": {
"ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 2,
"ceph version 16.0.0-2550-g7e0b165 (7e0b165c7bd27357d6b5b351f53c7a4b9c25ff08) pacific (dev)": 1
},
"mds": {
"ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 2
},
"rgw": {
"ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 1
},
"overall": {
"ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)": 8,
"ceph version 16.0.0-2550-g7e0b165 (7e0b165c7bd27357d6b5b351f53c7a4b9c25ff08) pacific (dev)": 2
}
}
Output from Mgr not Promethues:
# HELP ceph_mon_metadata MON Metadata
# TYPE ceph_mon_metadata untyped
ceph_mon_metadata{ceph_daemon="mon.a",hostname="compute-0",public_addr="172.30.195.126",rank="0",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
ceph_mon_metadata{ceph_daemon="mon.b",hostname="compute-2",public_addr="172.30.201.105",rank="1",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
ceph_mon_metadata{ceph_daemon="mon.c",hostname="compute-1",public_addr="172.30.121.36",rank="2",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
# HELP ceph_osd_metadata OSD Metadata
# TYPE ceph_osd_metadata untyped
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.0",cluster_addr="10.130.2.41",device_class="hdd",front_iface="eth0",hostname="compute-2",objectstore="bluestore",public_addr="10.130.2.41",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.1",cluster_addr="10.128.2.38",device_class="hdd",front_iface="eth0",hostname="compute-1",objectstore="bluestore",public_addr="10.128.2.38",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.2",cluster_addr="10.128.4.39",device_class="hdd",front_iface="eth0",hostname="compute-0",objectstore="bluestore",public_addr="10.128.4.39",ceph_version="ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)"} 1.0
This time, even after mgr and waiting for 10 mins, the version did not update in the exporter metric.
> This time, even after mgr and waiting for 10 mins, the version did not
> update in the exporter metric.
After MGR RESTART, metrics did not update.
This is starting to sound like this has got more to do with OCS than the prometheus exporter. Could it be that OCS gets auto-updated after the downgrade (or something similar)? Basically, if the metrics do not update even after the mgr restart, we are not getting the correct data from the ceph-mgr server and we are relying on that for the metrics. Also, how do you deploy ceph v16.0.0 in OCS? @boris I am using OCS 4.4 on OCP 4.4. It is not OCS with V16 ceph. I just upgraded the images in the deployment for one mon and osd. For ceph v16.0.0 : pull quay.io/ceph-ci/ceph:master for ceph v14 : quay.io/ceph-ci/ceph:nautilus So how do you update the image? Do you create a new CSV/appregistry container? @boris sorry I missed the comment. I update the image in the Deployment of 1 of the OSD and MON after the normal deployment is complete. Reproduced again - https://bugzilla.redhat.com/show_bug.cgi?id=1786696#c6 If that is the actual issue here then this is not an issue in the prometheus module but mgr itself as that si where we are getting these data from, re-assinging to RADOS for better coverage. it looks like a dup of #1844206, will take a closer look at it tomorrow. at this moment, i want to focus on the discrepancies between the versions reported by "ceph version" and those reported by prometheus ceph-mgr module first. the versions of mon reported by "ceph versions" comes from the MMonMetadata messages sent by peons. while the versions of mon reported by prometheus ceph-mgr module comes from the output of "ceph mon metadata" command sent by ceph-mgr when it launches. i admit that we should update the metadata of monitors when monmap is updated. but per https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c4 > After MGR RESTART, metrics did not update. this is weird. as "ceph mon metadata" is also served by monitor directly. and monitor looks up in the same source when serving this command as it does when serving "ceph version". @Neha and @Anmol could paste the output of "ceph mon metadata" and "ceph version" before and after upgrading monitor? @Anmol any update? Yesterday, I noticed "There are 2 different versions of Ceph Mon components running." alert right after timeouted installation of OCS 4.7 on GCP, which otherwise looked fine (all ceph components were running, ceph health was ok). But:
- ceph monitors running were "a", "b" and "d" (so something happened to "c", but "d" took over)
- OCP Console reported alert "There are 2 different versions of Ceph Mon components running."
- no version mismash noticed in ceph versions output
- querying OCP Prometheus for `ceph_mon_metadata{job="rook-ceph-mgr"}` reveals that for mon.b, there are no values of ceph_version and hostname fields, which triggers the alert
OCP 4.6.0-0.nightly-2020-12-13-230909
OCS v4.6.0-195.ci
@Martin: That is (probably) a different issue. Early on when ceph-mgr starts, the version (and/or hostname) field might not be populated, yet. It takes some time for ceph-mgr to gather all the data. We should just filter out these results in the alerting rule to avoid the version mismatch alert. Alternatively, we could stop showing incomplete metadata metrics in the prometheus module. This could hide some other issues and have some unintended consequences. Where is the alerting rule for OCP defined? Updating it should be fairly straight-forward. @nthomas : IIRC, you worked on these rules? Where are they defined? @Anmol in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c3, you mentioned > This time, even after mgr and waiting for 10 mins, the version did not update in the exporter metric. while in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c14, > After that I restarted the MGR pod and then the versions got updated in the Prometheus mgr metrics. if restarting mgr helps to get the monitor versions right. then the root cause must be that we fail to update mon metadata with the monmap . as i put at https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c12 . will create a patch to address the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1811027#c14. but i am wondering if this really addresses the issue, as it seems we have two different issues here. upstream PR pending on review: https://github.com/ceph/ceph/pull/38932 (In reply to Kefu Chai from comment #25) > upstream PR pending on review: https://github.com/ceph/ceph/pull/38932 Cool - this one is merged. Pacific backport - https://github.com/ceph/ceph/pull/39218 Octopus backport - https://github.com/ceph/ceph/pull/39219 Are we going to have it in Nautilus as well? (In reply to Yaniv Kaul from comment #26) > (In reply to Kefu Chai from comment #25) > > upstream PR pending on review: https://github.com/ceph/ceph/pull/38932 > > Cool - this one is merged. > Pacific backport - https://github.com/ceph/ceph/pull/39218 > Octopus backport - https://github.com/ceph/ceph/pull/39219 > > Are we going to have it in Nautilus as well? yes, the PR has already merged https://github.com/ceph/ceph/pull/39075 Below are the findings-
ceph-MGR(Working as expected)
------------------------------
After updateing the ceph packges ceph-mgr is updateing at "ceph versions" and "ceph mgr metadata"
Before update:-
---------------
[root@skymaster ~]# ceph versions
{
..............
},
"mgr": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 3
},
...............................
}
[root@skymaster ~]#
[root@skymaster ~]# ceph mgr metadata
[
{
"name": "harvard",
"addr": "10.70.39.14",
"addrs": "10.70.39.14:0/25069",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
"ceph_version_short": "14.2.11-121.el7cp",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"distro": "rhel",
"distro_description": "Storage",
"distro_version": "7.9",
"hostname": "harvard.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455672",
"os": "Linux"
},
{
"name": "havoc",
"addr": "10.70.39.15",
"addrs": "10.70.39.15:0/18841",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
"ceph_version_short": "14.2.11-121.el7cp",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"distro": "rhel",
"distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
"distro_version": "7.9",
"hostname": "havoc.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455676",
"os": "Linux"
},
{
"name": "skymaster",
"addr": "10.70.39.19",
"addrs": "10.70.39.19:0/15668",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
"ceph_version_short": "14.2.11-121.el7cp",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"distro": "rhel",
"distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
"distro_version": "7.9",
"hostname": "skymaster.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455692",
"os": "Linux"
}
]
[root@skymaster ~]#
After updating the packages:
---------------------------
[root@skymaster ~]# ceph versions
{
"mon": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2,
"ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1
},
"osd": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 77
},
"mds": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 4
},
"rbd-mirror": {
"ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2
},
"rgw": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2
},
"rgw-nfs": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1
},
"overall": {
"ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2,
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 89,
"ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1
}
}
[root@skymaster ~]# ceph mgr metadata
[
{
"name": "harvard",
"addr": "10.70.39.14",
"addrs": "10.70.39.14:0/25069",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
"ceph_version_short": "14.2.11-121.el7cp",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"distro": "rhel",
"distro_description": "Storage",
"distro_version": "7.9",
"hostname": "harvard.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455672",
"os": "Linux"
},
{
"name": "havoc",
"addr": "10.70.39.15",
"addrs": "10.70.39.15:0/18841",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
"ceph_version_short": "14.2.11-121.el7cp",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"distro": "rhel",
"distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
"distro_version": "7.9",
"hostname": "havoc.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455676",
"os": "Linux"
},
{
"name": "skymaster",
"addr": "10.70.39.19",
"addrs": "10.70.39.19:0/14135",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)",
"ceph_version_short": "14.2.11-176.el7cp",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"distro": "rhel",
"distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
"distro_version": "7.9",
"hostname": "skymaster.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455692",
"os": "Linux"
}
]
[root@skymaster ~]#
ceph-MGR versions are updated and working as expected.
ceph-mon(Not working as expected):-
------------------------------------
After updating the ceph-mon packages the versions are not updated in the "metadata" but version update is showing at "ceph versions" output. I tried with "systemctl reload <mon service>" but not updated at metadata.
After "systemctl restart <mon service>" in the metadata also the version is updated.
Below is the yum log snippet:-
------------------------------------
........................................
.......................................................
May 27 16:29:25 Updated: 2:ceph-selinux-14.2.11-176.el7cp.x86_64
May 27 16:29:26 Updated: 2:ceph-mgr-14.2.11-176.el7cp.x86_64
May 27 16:29:26 Updated: 2:ceph-grafana-dashboards-14.2.11-176.el7cp.noarch
May 27 16:29:28 Updated: 2:ceph-mgr-dashboard-14.2.11-176.el7cp.noarch
May 27 16:29:28 Updated: 2:ceph-mgr-diskprediction-local-14.2.11-176.el7cp.noarch
May 27 16:29:29 Updated: 2:ceph-mds-14.2.11-176.el7cp.x86_64
May 27 16:29:30 Updated: 2:ceph-mon-14.2.11-176.el7cp.x86_64
.................................................................................
....................................................................................
After update:-
--------------
[root@skymaster log]# ceph versions
{
"mon": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1,
"ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1,
"ceph version 14.2.11-177.el7cp (0486420967ea3327d3ba01d3184f3ab96ddaa616) nautilus (stable)": 1
},
"osd": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 77
},
"mds": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 4
},
"rbd-mirror": {
"ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2
},
"rgw": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2
},
"rgw-nfs": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1
},
"overall": {
"ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2,
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 88,
"ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1,
"ceph version 14.2.11-177.el7cp (0486420967ea3327d3ba01d3184f3ab96ddaa616) nautilus (stable)": 1
}
}
The same output after reload also.
ceph-mon versions updated after service restart:
-------------------------------------------------
After restart:-
[root@skymaster log]# ceph versions
{
"mon": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2,
"ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1-------------------------------------> version updated
},
"mgr": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1,
"ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 1,
"ceph version 14.2.11-177.el7cp (0486420967ea3327d3ba01d3184f3ab96ddaa616) nautilus (stable)": 1
},
"osd": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 77
},
"mds": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 4
},
"rbd-mirror": {
"ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2
},
"rgw": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 2
},
"rgw-nfs": {
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 1
},
"overall": {
"ceph version 14.2.11-119.el7cp (4afc970bb5fa3a23e3ea64bab2cc4d50d2e3f837) nautilus (stable)": 2,
"ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)": 87,
"ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)": 2,
"ceph version 14.2.11-177.el7cp (0486420967ea3327d3ba01d3184f3ab96ddaa616) nautilus (stable)": 1
}
}
[root@skymaster log]#
[root@skymaster log]# ceph mon metadata --------------------------------------------->>restarted the service in skymaster
[
{
"name": "harvard",
"addrs": "[v2:10.70.39.14:3300/0,v1:10.70.39.14:6789/0]",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
"ceph_version_short": "14.2.11-121.el7cp",
"compression_algorithms": "none, snappy, zlib, zstd, lz4",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"device_ids": "",
"device_paths": "",
"devices": "",
"distro": "rhel",
"distro_description": "Employee SKU",
"distro_version": "7.9",
"hostname": "harvard.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455672",
"os": "Linux"
},
{
"name": "havoc",
"addrs": "[v2:10.70.39.15:3300/0,v1:10.70.39.15:6789/0]",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-121.el7cp (103c47fc6f676ee9b20893c346b5380c0b4be459) nautilus (stable)",
"ceph_version_short": "14.2.11-121.el7cp",
"compression_algorithms": "none, snappy, zlib, zstd, lz4",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"device_ids": "",
"device_paths": "",
"devices": "",
"distro": "rhel",
"distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
"distro_version": "7.9",
"hostname": "havoc.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455676",
"os": "Linux"
},
{
"name": "skymaster",
"addrs": "[v2:10.70.39.19:3300/0,v1:10.70.39.19:6789/0]",
"arch": "x86_64",
"ceph_release": "nautilus",
"ceph_version": "ceph version 14.2.11-176.el7cp (3b8868b199b96182092d76f5da192852edea9bc8) nautilus (stable)",
"ceph_version_short": "14.2.11-176.el7cp",------------------------------------------------------------------------------>>Version updated
"compression_algorithms": "none, snappy, zlib, zstd, lz4",
"cpu": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
"device_ids": "",
"device_paths": "",
"devices": "",
"distro": "rhel",
"distro_description": "Red Hat Enterprise Linux Server 7.9 (Maipo)",
"distro_version": "7.9",
"hostname": "skymaster.lab.eng.blr.redhat.com",
"kernel_description": "#1 SMP Thu Jan 21 16:15:07 EST 2021",
"kernel_version": "3.10.0-1160.15.2.el7.x86_64",
"mem_swap_kb": "16318460",
"mem_total_kb": "32455692",
"os": "Linux"
}
]
[root@skymaster log]#
The same with the other nodes.
In the ceph-mon shell script there is a step to restart the service after ceph-mon update.
ceph-mon Shell script:
----------------------
[root@skymaster log]# rpm -q --scripts ceph-mon
............................................................
.....................................................................
.....................................................................
if [ $1 -ge 1 ] ; then
# Restart on upgrade, but only if "CEPH_AUTO_RESTART_ON_UPGRADE" is set to
# "yes". In any case: if units are not running, do not touch them.
SYSCONF_CEPH=/etc/sysconfig/ceph
if [ -f $SYSCONF_CEPH -a -r $SYSCONF_CEPH ] ; then
source $SYSCONF_CEPH
fi
if [ "X$CEPH_AUTO_RESTART_ON_UPGRADE" = "Xyes" ] ; then
/usr/bin/systemctl try-restart ceph-mon@\*.service > /dev/null 2>&1 || :
fi
fi
[root@skymaster log]#
Please let me know when the Fix will be provided.The RC data is palled on 8-June @Neha- Please let me know this issue is a blocker for the release?
Restarting the service is upgrading the CEPH version in meta data.
Hi Neha, Looks like this Bug is in open state with no target date for fix. RHCS 4.3 Dev freeze was on Sep 24. Is this planned for 4.3? @skanta (In reply to skanta from comment #32) > Below are the findings- > > ceph-MGR(Working as expected) > ------------------------------ > > After updateing the ceph packges ceph-mgr is updateing at "ceph versions" > and "ceph mgr metadata" > .. > ceph-MGR versions are updated and working as expected. This confirms that the upstream PR#38932 has fixed this issue where mon metadata is not getting updated when monmap is updated. > > ceph-mon(Not working as expected):- > ------------------------------------ > > After updating the ceph-mon packages the versions are not updated in the > "metadata" but version update is showing at "ceph versions" output. I tried > with "systemctl reload <mon service>" but not updated at metadata. > After "systemctl restart <mon service>" in the metadata also the version is > updated. Do you still have this cluster handy ? The "systemctl reload <mon service>" keeps ceph-mon process running and asks ceph-mon to reload the ceph config (not the ceph-mon systemd unit file). > > The same with the other nodes. > > In the ceph-mon shell script there is a step to restart the service after > ceph-mon update. > > ceph-mon Shell script: > ---------------------- > > [root@skymaster log]# rpm -q --scripts ceph-mon > ............................................................ > ..................................................................... > ..................................................................... > > if [ $1 -ge 1 ] ; then > # Restart on upgrade, but only if "CEPH_AUTO_RESTART_ON_UPGRADE" is set to > # "yes". In any case: if units are not running, do not touch them. > SYSCONF_CEPH=/etc/sysconfig/ceph > if [ -f $SYSCONF_CEPH -a -r $SYSCONF_CEPH ] ; then > source $SYSCONF_CEPH > fi > if [ "X$CEPH_AUTO_RESTART_ON_UPGRADE" = "Xyes" ] ; then > /usr/bin/systemctl try-restart ceph-mon@\*.service > /dev/null 2>&1 || : > fi > fi > [root@skanta Restarting ceph (e.g ceph-mon here) services after ceph upgrade is by default disabled (It's controlled by CEPH_AUTO_RESTART_ON_UPGRADE which is set to "no" by default) # grep -v '^#' /etc/sysconfig/ceph TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 CEPH_AUTO_RESTART_ON_UPGRADE=no CLUSTER=ceph (In reply to skanta from comment #47) > Cluster is not handy but I tried to reproduce the issue from 4.2 released > version(ceph version 14.2.11-202.el8cp ) to the latest version(ceph version > 14.2.11-203.el8cp). > > Version updated successfully without restarting the service. @skanta So we are no longer seeing the issue of version not getting updated after ceph upgrade and/or ceph daemon restart ? Btw we have to restart ceph daemons on the node after ceph packages are upgraded. Are we good to move this BZ to verified state ? Hi Skanta, I was trying to recreate the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1773594 (which was blocked by this BZ), but I'm unable to get an RHCS image which has the fix. The max/latest version I could get is '14.2.11-199.el8cp.x86_64', which is a month old and didn't seem to have the fix. Can you please confirm which RHCS version has the fix and (if possible) point me to the image link? @skanta The fix was provided in ceph-14.2.11-157.el8cp.If you require more details could you please contact development team. (In reply to arun kumar mohan from comment #50) > Hi Skanta, > I was trying to recreate the BZ: > https://bugzilla.redhat.com/show_bug.cgi?id=1773594 (which was blocked by > this BZ), but I'm unable to get an RHCS image which has the fix. > The max/latest version I could get is '14.2.11-199.el8cp.x86_64', which is a > month old and didn't seem to have the fix. > Can you please confirm which RHCS version has the fix and (if possible) > point me to the image link? > > @skanta @amohan Are you still blocked on testing ? Are you able to get right RHCS image ? Removing the needinfo as the BZ is verified |
Description of problem: Ceph-mgr is not updating ceph_version in ceph_*_metadata metric without a restart. Platform: Openshift 4.3 Product: OCS 4.2 Ceph Image used: registry.redhat.io/ocs4/rhceph-rhel8 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install OCS 4.2 on Openshift 4.3. 2. Check for ceph_mon_metadata and ceph_osd_metadata Prometheus metric 3. Update one of the mon and osd deployment images to another images. Used image : quay.io/ceph-ci/ceph:wip-sage3-testing-2020-03-05-2154. Might also be possible reproduce on standard rhcs by upgrading one of the mons and osd's. 4. Check for ceph_mon_metadata and ceph_osd_metadata Prometheus metric. It would be same. 5. Restart the ceph-mgr and check the metrics again. The metrics would be updated. 6. Downgrade the mon and osd to the original versions. Check the metrics again. Ceph Mgr would not update the metrics to report the latest version. Actual results: ceph_version in ceph_*_metadata metric did not reflect the correct version running in the system. Expected results: ceph_version in ceph_*_metadata metric should reflect the correct version running in the system. Additional info: ceph versions command { "mon": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2, "ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)": 1 }, "mgr": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2, "ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)": 1 }, "mds": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 8, "ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)": 2 } Metric after version upgrades: ceph_mon_metadata{ceph_daemon="mon.a",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",endpoint="http-metrics",hostname="compute-1",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.21.187",rank="0",service="rook-ceph-mgr"} 1 ceph_mon_metadata{ceph_daemon="mon.b",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",endpoint="http-metrics",hostname="compute-0",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.178.215",rank="1",service="rook-ceph-mgr"} 1 ceph_mon_metadata{ceph_daemon="mon.d",ceph_version="ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)",endpoint="http-metrics",hostname="compute-2",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.45.1",rank="2",service="rook-ceph-mgr"} ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.0",ceph_version="ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)",cluster_addr="10.128.2.55",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-2",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.128.2.55",service="rook-ceph-mgr"} 1 ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.1",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",cluster_addr="10.129.2.35",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-0",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.129.2.35",service="rook-ceph-mgr"} 1 ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.2",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",cluster_addr="10.130.2.32",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-1",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.130.2.32",service="rook-ceph-mgr"} After Version restoring to original : ceph versions { "mon": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2 }, "mds": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)": 9 } } Metrics : ceph_mon_metadata{ceph_daemon="mon.a",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",endpoint="http-metrics",hostname="compute-1",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.21.187",rank="0",service="rook-ceph-mgr"} 1 ceph_mon_metadata{ceph_daemon="mon.b",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",endpoint="http-metrics",hostname="compute-0",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.178.215",rank="1",service="rook-ceph-mgr"} 1 ceph_mon_metadata{ceph_daemon="mon.d",ceph_version="ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)",endpoint="http-metrics",hostname="compute-2",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="172.30.45.1",rank="2",service="rook-ceph-mgr"} ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.0",ceph_version="ceph version 15.1.0-1799-gd0b45e4 (d0b45e421291d4cfcb430fde01a232cb768f3e14) octopus (rc)",cluster_addr="10.128.2.55",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-2",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.128.2.55",service="rook-ceph-mgr"} 1 ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.1",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",cluster_addr="10.129.2.35",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-0",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.129.2.35",service="rook-ceph-mgr"} 1 ceph_osd_metadata{back_iface="eth0",ceph_daemon="osd.2",ceph_version="ceph version 14.2.4-69.el8cp (8d72f97ca776c758a7ce0009959ca3044cd0b9c2) nautilus (stable)",cluster_addr="10.130.2.32",device_class="hdd",endpoint="http-metrics",front_iface="eth0",hostname="compute-1",instance="10.130.2.51:9283",job="rook-ceph-mgr",namespace="openshift-storage",objectstore="bluestore",pod="rook-ceph-mgr-a-688466b494-lwfn7",public_addr="10.130.2.32",service="rook-ceph-mgr"} After this changed OSD failed to come up with the original version, but the metric did not change. It should have updated to 0.