Bug 2267067 - rbd metrics are not available on Provider-Client cluster
Summary: rbd metrics are not available on Provider-Client cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.16.0
Assignee: mrudraia
QA Contact: Filip Balák
URL:
Whiteboard: isf-provider
Depends On:
Blocks: 2260844
TreeView+ depends on / blocked
 
Reported: 2024-02-29 14:05 UTC by Filip Balák
Modified: 2024-07-17 13:14 UTC (History)
10 users (show)

Fixed In Version: 4.16.0-94
Doc Type: Bug Fix
Doc Text:
.RBD metrics not available on Provider-Client cluster Previously, RBD metrics were not populated in Fusion HCI based OpenShift Data Foundation provider client clusters as a CSI issue caused all the RWX CephFS storage class PVs to mount with root permission. This caused an authentication issue with `ceph-mgr` while checking the RBD pool stats. With this fix, the Ceph cluster misconfiguration that caused the issue was corrected and as a result all the RBD related metrics are available with the provider client clusters.
Clone Of:
Environment:
Last Closed: 2024-07-17 13:14:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2227781 0 unspecified CLOSED ceph_rbd_* metrics are missing 2024-02-29 14:05:21 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:14:55 UTC

Description Filip Balák 2024-02-29 14:05:21 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Following metrics are missing after deployment:
  +  'ceph_rbd_write_ops',
  +  'ceph_rbd_read_ops',
  +  'ceph_rbd_write_bytes',
  +  'ceph_rbd_read_bytes',
  +  'ceph_rbd_write_latency_sum',
  +  'ceph_rbd_write_latency_count',

Version of all relevant components (if applicable):
odf 4.14.5-8

Steps to Reproduce:
1. Deploy Fusion HCI Provider with ODF.
2. Enable monitoring of openshift-storage namespace by adding the label:
  openshift.io/cluster-monitoring: "true"
3. Check for metrics from the description.

Actual results:
Metrics are missing and need to be enabled by 
ceph config set mgr mgr/prometheus/rbd_stats_pools "*"

Expected results:
Metrics are available.

Additional info:

Comment 2 Rohan Gupta 2024-03-04 10:52:43 UTC
This will be brought to the PM and then a decision will be made to include these metrics to provider client mode

Comment 3 Nishanth Thomas 2024-03-07 08:15:27 UTC
@fbalak , only the 6 metrics mentioned above are missing? Any other rbd metrics are impacted?

Comment 4 Filip Balák 2024-03-07 08:50:42 UTC
It looks like only those 6.

Another test also failed with rgw metrics: 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rgw_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_metadata', 'ceph_rgw_qactive', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req','ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_get', 'ceph_rgw_cache_hit', 'ceph_rgw_put_b' but I tried them manually after the test and those data points provided data. When I tried manually those rbd metrics, they were missing as seen in the test.

Comment 5 Filip Balák 2024-03-07 08:57:37 UTC
Those metrics were checked directly from Prometheus and from Metrics page in OpenShift. The workaround mentioned (ceph config set mgr mgr/prometheus/rbd_stats_pools "*") was not tried, only provided as info from https://bugzilla.redhat.com/show_bug.cgi?id=2237412#c4.

Comment 6 arun kumar mohan 2024-03-07 10:31:19 UTC
I had similar conversation with Shay regarding the ceph rbd pool metrics.
He had to provide the above (said) command to get the stats:

ceph config set mgr mgr/prometheus/rbd_stats_pools "*"

More info in the doc: https://docs.ceph.com/en/quincy/mgr/prometheus/#rbd-io-statistics

Comment 7 Filip Balák 2024-03-11 10:48:18 UTC
I will install fresh cluster without CI and check if those metrics are present to make sure that we don't add them with automation in internal mode.

Comment 8 Filip Balák 2024-03-18 10:41:27 UTC
I have installed fresh cluster on AWS ipi platform and manually installed operator odf-operator.v4.14.5-rhodf and storage system. Metrics listed in description are present without any other action from user.

Comment 9 arun kumar mohan 2024-04-03 11:34:04 UTC
Closing the BZ as per the above comment#8. Please re-open if required.

Comment 10 Filip Balák 2024-04-03 14:38:27 UTC
Reopening because listed metrics are still not available on Provider client platform. This behaviour is different from internal mode platforms where they are present without additional user interaction. Tested again with odf 4.14.6-1 on Provider-Client cluster.

Comment 11 mrudraia 2024-04-29 08:07:52 UTC
Checked the issue in Fusion HCI cluster. It has been onbserved some misconfigurations in CEPH cluster, and the ceph cluster is not healthy, Ceph helath shown warn condition. Some of the mon clients have authenication error. Detaliled RCA in the attached docs- https://docs.google.com/document/d/134jDs3KEM6Q6h3u4kOwOVpEQ0HJ4S3kiVy_mtyen-WI/edit?usp=sharing .
Waiting for CEPH team to check this issue.

Comment 12 mrudraia 2024-05-06 11:03:37 UTC
Checked the RBD metrics on IBM provider - client Bare metal cluster. The metrics are available.
  +  'ceph_rbd_write_ops',
  +  'ceph_rbd_read_ops',
  +  'ceph_rbd_write_bytes',
  +  'ceph_rbd_read_bytes',
  +  'ceph_rbd_write_latency_sum',
  +  'ceph_rbd_write_latency_count'

Comment 16 mrudraia 2024-05-13 06:59:10 UTC
The Issue is resolved in 4.16 DF version.
This is CSI issue that the cephfs SC created RWX PV are mounted as root permission. Not sure if it's related with the metric problem - since they are happened in similar environment.
The issue need to be tested in the next release of 4.16 version.

Comment 17 Filip Balák 2024-05-14 08:56:10 UTC
Metrics are available on Provider-Client setup. --> VERIFIED

Tested with ODF 4.16.0-94

Comment 19 arun kumar mohan 2024-05-30 13:35:17 UTC
Need to get the PR or fix-details related to this BZ.
Added the rest of the RDT details

Comment 20 arun kumar mohan 2024-05-30 15:33:36 UTC
Added the fix details as well to RDT, please take a look

Comment 22 errata-xmlrpc 2024-07-17 13:14:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.