Bug 1779336
Summary: | OCS Monitoring is missing ceph_rbd_* metrics | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Martin Bukatovic <mbukatov> | ||||||||
Component: | ceph-monitoring | Assignee: | Anmol Sachan <asachan> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Oded <oviner> | ||||||||
Severity: | low | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.2 | CC: | afrahman, asachan, bniver, etamir, madam, muagarwa, nthomas, ocs-bugs, odf-bz-bot, owasserm, pcuzner, ratamir, shan, smordech, uchapaga | ||||||||
Target Milestone: | --- | Keywords: | FutureFeature | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | 4.6.0 | Doc Type: | No Doc Update | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-06-08 06:31:55 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1850947 | ||||||||||
Bug Blocks: | 1916331 | ||||||||||
Attachments: |
|
Description
Martin Bukatovic
2019-12-03 19:06:53 UTC
Certainly needs to be fixed but I doubt it's critical enough to block 4.2.0 GA. @Nishanth, proposing to move to 4.3 (or 4.2.z). Moved out to 4.4 (In reply to Nishanth Thomas from comment #3) > Moved out to 4.4 Why? I think the pools need to be added to the prometheus modules rbd_stat_pools setting. Once this is done and the module disabled/enabled data should be visible. Be aware though that if anything is relying on the prometheus port being there (liveness check for example), it could in theory look to k8s that the mgr is down - so if this happens you know why! Just checked the process on a local machine. prometheus doesn't need to be restarted - but the rbd_stat_pool needs to be updated/defined Once the pool is defined e.g. ceph config set mgr mgr/prometheus/rbd_stats_pools rbd you'll see stats like this; ceph_rbd_write_bytes{pool="rbd",namespace="",image="testdisk"} 0.0 # HELP ceph_rbd_read_bytes RBD image bytes read # TYPE ceph_rbd_read_bytes counter ceph_rbd_read_bytes{pool="rbd",namespace="",image="testdisk"} 0.0 # HELP ceph_rbd_write_latency_sum RBD image writes latency (msec) Total # TYPE ceph_rbd_write_latency_sum counter ceph_rbd_write_latency_sum{pool="rbd",namespace="",image="testdisk"} 0.0 # HELP ceph_rbd_write_latency_count RBD image writes latency (msec) Count # TYPE ceph_rbd_write_latency_count counter ceph_rbd_write_latency_count{pool="rbd",namespace="",image="testdisk"} 0.0 # HELP ceph_rbd_read_latency_sum RBD image reads latency (msec) Total # TYPE ceph_rbd_read_latency_sum counter ceph_rbd_read_latency_sum{pool="rbd",namespace="",image="testdisk"} 0.0 # HELP ceph_rbd_read_latency_count RBD image reads latency (msec) Count # TYPE ceph_rbd_read_latency_count counter ceph_rbd_read_latency_count{pool="rbd",namespace="",image="testdisk"} 0.0 in addition to rbd_stats_pools (space or comma separated list of pools) there is also rbd_stats_pools_refresh_interval (which IIRC defaults to 5mins) Given the dependency on defining the pool, for OCS this would probably need to be tied into the rook-ceph work-flow. For example, storageclass created on rook-block provider, rook would need to update the rbd_stats_pools list - and obviously the reverse is also true. Alternatively, maybe we could change the code to report on pools that have the rbd application enabled, and backport? Might be simpler however, if we're expecting 000's of rbd's this could put load on the mgr and the prometheus instance storage too. Given the above, I don't think this is a 4.2 thing - more like an RFE for 4.3 or 4.4 (In reply to Yaniv Kaul from comment #4) > (In reply to Nishanth Thomas from comment #3) > > Moved out to 4.4 > > Why? Its more of an RFE;hence I think its better to handle this in 4.4 as 4.3 window is short. Wanted to check if there is an urgency to get this done for 4.3 and in that case we can prioritize this. None of the dashboard features waiting on this. (In reply to Paul Cuzner from comment #6) > Just checked the process on a local machine. prometheus doesn't need to be > restarted - but the rbd_stat_pool needs to be updated/defined > > Once the pool is defined > e.g. ceph config set mgr mgr/prometheus/rbd_stats_pools rbd > > you'll see stats like this; > ceph_rbd_write_bytes{pool="rbd",namespace="",image="testdisk"} 0.0 > # HELP ceph_rbd_read_bytes RBD image bytes read > # TYPE ceph_rbd_read_bytes counter > ceph_rbd_read_bytes{pool="rbd",namespace="",image="testdisk"} 0.0 > # HELP ceph_rbd_write_latency_sum RBD image writes latency (msec) Total > # TYPE ceph_rbd_write_latency_sum counter > ceph_rbd_write_latency_sum{pool="rbd",namespace="",image="testdisk"} 0.0 > # HELP ceph_rbd_write_latency_count RBD image writes latency (msec) Count > # TYPE ceph_rbd_write_latency_count counter > ceph_rbd_write_latency_count{pool="rbd",namespace="",image="testdisk"} 0.0 > # HELP ceph_rbd_read_latency_sum RBD image reads latency (msec) Total > # TYPE ceph_rbd_read_latency_sum counter > ceph_rbd_read_latency_sum{pool="rbd",namespace="",image="testdisk"} 0.0 > # HELP ceph_rbd_read_latency_count RBD image reads latency (msec) Count > # TYPE ceph_rbd_read_latency_count counter > ceph_rbd_read_latency_count{pool="rbd",namespace="",image="testdisk"} 0.0 > > in addition to rbd_stats_pools (space or comma separated list of pools) > there is also rbd_stats_pools_refresh_interval (which IIRC defaults to 5mins) > > Given the dependency on defining the pool, for OCS this would probably need > to be tied into the rook-ceph work-flow. For example, storageclass created > on rook-block provider, rook would need to update the rbd_stats_pools list - > and obviously the reverse is also true. > > Alternatively, maybe we could change the code to report on pools that have > the rbd application enabled, and backport? Might be simpler > > however, if we're expecting 000's of rbd's this could put load on the mgr > and the prometheus instance storage too. Yes, we do expect that number eventually. > > Given the above, I don't think this is a 4.2 thing - more like an RFE for > 4.3 or 4.4 is there an open upstream issue on Rook to get this done? @nishanth, I think it's valuable in Prometheus, mainly for debugging. Based on the feedback we will get, we will consider adding it to the dashboards. (In reply to Eran Tamir from comment #12) > @nishanth, I think it's valuable in Prometheus, mainly for debugging. > Based on the feedback we will get, we will consider adding it to the > dashboards. So what's the priority here? Obviously it's not a 4.4 material. (In reply to Eran Tamir from comment #12) > @nishanth, I think it's valuable in Prometheus, mainly for debugging. > Based on the feedback we will get, we will consider adding it to the > dashboards. Closing for the time being, till we get some feedback. (In reply to Yaniv Kaul from comment #14) > (In reply to Eran Tamir from comment #12) > > @nishanth, I think it's valuable in Prometheus, mainly for debugging. > > Based on the feedback we will get, we will consider adding it to the > > dashboards. > > Closing for the time being, till we get some feedback. As I note in comment 10, this feature should be considered during a redesign of the current way OCS PV dashboard reports storage utilization. I suggest to keep this open, until this redesign is actually planned. If that happened and I just missed it, I'm sorry for the trouble. In such a case, please reference it here in a comment. From a Rook perspective, we can add the "ceph config set mgr mgr/prometheus/rbd_stats_pools rbd" as part of the pool creation (with CephBlockPool CRD). Paul, is the CLI identical with rbd_stats_pools_refresh_interval option? Is it also pool based? Thanks. Seb, the setting is a comma or space separated string, containing all pools - so you'll need to get current and append. The refresh interval is the timer for when the scrape pool data is refreshed, the rbd stats are gathered at every collection interval. However, although this is relatively easy to enable I'd be wary of turning it on by default until we understand the impact it has on the mgr for 1000's of PV's. As I said earlier in this thread we also don't expose it in the dashboard - so adding this overhead has limited benefit. If the main goal here is debug, would the "rbd top" commands from the rbd_support module be an alternate? It's autoenabled anyway. (In reply to Paul Cuzner from comment #17) > Seb, the setting is a comma or space separated string, containing all pools > - so you'll need to get current and append. The refresh interval is the > timer for when the scrape pool data is refreshed, the rbd stats are gathered > at every collection interval. > > However, although this is relatively easy to enable I'd be wary of turning > it on by default until we understand the impact it has on the mgr for 1000's > of PV's. As I said earlier in this thread we also don't expose it in the > dashboard - so adding this overhead has limited benefit. That's the approach that has been taken, it's disabled by default on CephBlockPool creation. > > If the main goal here is debug, would the "rbd top" commands from the > rbd_support module be an alternate? It's autoenabled anyway. Thanks! There are not "ceph_rbd*" metrics. SetUp: prvider:vmare OCP Version:4.6.0-0.nightly-2020-09-29-170625 OCS Version:ocs-operator.v4.6.0-101.ci sh-4.4# ceph versions { "mon": { "ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)": 11 } } Test Process: 1.Install OCP+OCS cluster 2.Login as kubeadmin to OCP Console 3.Go to Monitoring -> Metrics page 4.Type eg. ceph_rbd into the query and run it There are not "ceph_rbd*" metrics. **Attached screenshot Created attachment 1717817 [details]
ceph_rbd_metrics
Created attachment 1717825 [details]
ceph_rbd_read_bytes not found
Even though everything is configured as expected and tested before, RBD metrics is indeed missing. Will need someone with Ceph expertise to identify what went wrong on ceph-mgr side. Created attachment 1719558 [details]
example of the rbd perf tool
Must gather logs? Is the config correctly applied on the pool? Can someone query the mgr metrics directly? @Shiri can you please add all dashboard BZs as a reference here? (In reply to Eran Tamir from comment #32) > @Shiri can you please add all dashboard BZs as a reference here? https://bugzilla.redhat.com/show_bug.cgi?id=1866340 https://bugzilla.redhat.com/show_bug.cgi?id=1866331 https://bugzilla.redhat.com/show_bug.cgi?id=1866341 https://bugzilla.redhat.com/show_bug.cgi?id=1866338 Reopening based on comment https://bugzilla.redhat.com/show_bug.cgi?id=1916331#c3 from Travis. |