Bug 2267067

Summary:	rbd metrics are not available on Provider-Client cluster
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Filip Balák <fbalak>
Component:	ceph-monitoring	Assignee:	mrudraia
Status:	CLOSED ERRATA	QA Contact:	Filip Balák <fbalak>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.14	CC:	amohan, asriram, kbg, mrudraia, muagarwa, nthomas, odf-bz-bot, rchikatw, resoni, rohgupta
Target Milestone:	---	Keywords:	Reopened
Target Release:	ODF 4.16.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	isf-provider
Fixed In Version:	4.16.0-94	Doc Type:	Bug Fix
Doc Text:	.RBD metrics not available on Provider-Client cluster Previously, RBD metrics were not populated in Fusion HCI based OpenShift Data Foundation provider client clusters as a CSI issue caused all the RWX CephFS storage class PVs to mount with root permission. This caused an authentication issue with `ceph-mgr` while checking the RBD pool stats. With this fix, the Ceph cluster misconfiguration that caused the issue was corrected and as a result all the RBD related metrics are available with the provider client clusters.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-07-17 13:14:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2260844

Description Filip Balák 2024-02-29 14:05:21 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Following metrics are missing after deployment:
  +  'ceph_rbd_write_ops',
  +  'ceph_rbd_read_ops',
  +  'ceph_rbd_write_bytes',
  +  'ceph_rbd_read_bytes',
  +  'ceph_rbd_write_latency_sum',
  +  'ceph_rbd_write_latency_count',

Version of all relevant components (if applicable):
odf 4.14.5-8

Steps to Reproduce:
1. Deploy Fusion HCI Provider with ODF.
2. Enable monitoring of openshift-storage namespace by adding the label:
  openshift.io/cluster-monitoring: "true"
3. Check for metrics from the description.

Actual results:
Metrics are missing and need to be enabled by 
ceph config set mgr mgr/prometheus/rbd_stats_pools "*"

Expected results:
Metrics are available.

Additional info:

Comment 2 Rohan Gupta 2024-03-04 10:52:43 UTC

This will be brought to the PM and then a decision will be made to include these metrics to provider client mode

Comment 3 Nishanth Thomas 2024-03-07 08:15:27 UTC

@fbalak , only the 6 metrics mentioned above are missing? Any other rbd metrics are impacted?

Comment 4 Filip Balák 2024-03-07 08:50:42 UTC

It looks like only those 6.

Another test also failed with rgw metrics: 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rgw_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_metadata', 'ceph_rgw_qactive', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req','ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_get', 'ceph_rgw_cache_hit', 'ceph_rgw_put_b' but I tried them manually after the test and those data points provided data. When I tried manually those rbd metrics, they were missing as seen in the test.

Comment 5 Filip Balák 2024-03-07 08:57:37 UTC

Those metrics were checked directly from Prometheus and from Metrics page in OpenShift. The workaround mentioned (ceph config set mgr mgr/prometheus/rbd_stats_pools "*") was not tried, only provided as info from https://bugzilla.redhat.com/show_bug.cgi?id=2237412#c4.

Comment 6 arun kumar mohan 2024-03-07 10:31:19 UTC

I had similar conversation with Shay regarding the ceph rbd pool metrics.
He had to provide the above (said) command to get the stats:

ceph config set mgr mgr/prometheus/rbd_stats_pools "*"

More info in the doc: https://docs.ceph.com/en/quincy/mgr/prometheus/#rbd-io-statistics

Comment 7 Filip Balák 2024-03-11 10:48:18 UTC

I will install fresh cluster without CI and check if those metrics are present to make sure that we don't add them with automation in internal mode.

Comment 8 Filip Balák 2024-03-18 10:41:27 UTC

I have installed fresh cluster on AWS ipi platform and manually installed operator odf-operator.v4.14.5-rhodf and storage system. Metrics listed in description are present without any other action from user.

Comment 9 arun kumar mohan 2024-04-03 11:34:04 UTC

Closing the BZ as per the above comment#8. Please re-open if required.

Comment 10 Filip Balák 2024-04-03 14:38:27 UTC

Reopening because listed metrics are still not available on Provider client platform. This behaviour is different from internal mode platforms where they are present without additional user interaction. Tested again with odf 4.14.6-1 on Provider-Client cluster.

Comment 11 mrudraia 2024-04-29 08:07:52 UTC

Checked the issue in Fusion HCI cluster. It has been onbserved some misconfigurations in CEPH cluster, and the ceph cluster is not healthy, Ceph helath shown warn condition. Some of the mon clients have authenication error. Detaliled RCA in the attached docs- https://docs.google.com/document/d/134jDs3KEM6Q6h3u4kOwOVpEQ0HJ4S3kiVy_mtyen-WI/edit?usp=sharing .
Waiting for CEPH team to check this issue.

Comment 12 mrudraia 2024-05-06 11:03:37 UTC

Checked the RBD metrics on IBM provider - client Bare metal cluster. The metrics are available.
  +  'ceph_rbd_write_ops',
  +  'ceph_rbd_read_ops',
  +  'ceph_rbd_write_bytes',
  +  'ceph_rbd_read_bytes',
  +  'ceph_rbd_write_latency_sum',
  +  'ceph_rbd_write_latency_count'

Comment 16 mrudraia 2024-05-13 06:59:10 UTC

The Issue is resolved in 4.16 DF version.
This is CSI issue that the cephfs SC created RWX PV are mounted as root permission. Not sure if it's related with the metric problem - since they are happened in similar environment.
The issue need to be tested in the next release of 4.16 version.

Comment 17 Filip Balák 2024-05-14 08:56:10 UTC

Metrics are available on Provider-Client setup. --> VERIFIED

Tested with ODF 4.16.0-94

Comment 19 arun kumar mohan 2024-05-30 13:35:17 UTC

Need to get the PR or fix-details related to this BZ.
Added the rest of the RDT details

Comment 20 arun kumar mohan 2024-05-30 15:33:36 UTC

Added the fix details as well to RDT, please take a look

Comment 22 errata-xmlrpc 2024-07-17 13:14:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591