Description of problem (please be detailed as possible and provide log snippests): Following metrics are missing after deployment: + 'ceph_rbd_write_ops', + 'ceph_rbd_read_ops', + 'ceph_rbd_write_bytes', + 'ceph_rbd_read_bytes', + 'ceph_rbd_write_latency_sum', + 'ceph_rbd_write_latency_count', Version of all relevant components (if applicable): odf 4.14.5-8 Steps to Reproduce: 1. Deploy Fusion HCI Provider with ODF. 2. Enable monitoring of openshift-storage namespace by adding the label: openshift.io/cluster-monitoring: "true" 3. Check for metrics from the description. Actual results: Metrics are missing and need to be enabled by ceph config set mgr mgr/prometheus/rbd_stats_pools "*" Expected results: Metrics are available. Additional info:
This will be brought to the PM and then a decision will be made to include these metrics to provider client mode
@fbalak , only the 6 metrics mentioned above are missing? Any other rbd metrics are impacted?
It looks like only those 6. Another test also failed with rgw metrics: 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rgw_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_metadata', 'ceph_rgw_qactive', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req','ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_get', 'ceph_rgw_cache_hit', 'ceph_rgw_put_b' but I tried them manually after the test and those data points provided data. When I tried manually those rbd metrics, they were missing as seen in the test.
Those metrics were checked directly from Prometheus and from Metrics page in OpenShift. The workaround mentioned (ceph config set mgr mgr/prometheus/rbd_stats_pools "*") was not tried, only provided as info from https://bugzilla.redhat.com/show_bug.cgi?id=2237412#c4.
I had similar conversation with Shay regarding the ceph rbd pool metrics. He had to provide the above (said) command to get the stats: ceph config set mgr mgr/prometheus/rbd_stats_pools "*" More info in the doc: https://docs.ceph.com/en/quincy/mgr/prometheus/#rbd-io-statistics
I will install fresh cluster without CI and check if those metrics are present to make sure that we don't add them with automation in internal mode.
I have installed fresh cluster on AWS ipi platform and manually installed operator odf-operator.v4.14.5-rhodf and storage system. Metrics listed in description are present without any other action from user.
Closing the BZ as per the above comment#8. Please re-open if required.
Reopening because listed metrics are still not available on Provider client platform. This behaviour is different from internal mode platforms where they are present without additional user interaction. Tested again with odf 4.14.6-1 on Provider-Client cluster.
Checked the issue in Fusion HCI cluster. It has been onbserved some misconfigurations in CEPH cluster, and the ceph cluster is not healthy, Ceph helath shown warn condition. Some of the mon clients have authenication error. Detaliled RCA in the attached docs- https://docs.google.com/document/d/134jDs3KEM6Q6h3u4kOwOVpEQ0HJ4S3kiVy_mtyen-WI/edit?usp=sharing . Waiting for CEPH team to check this issue.
Checked the RBD metrics on IBM provider - client Bare metal cluster. The metrics are available. + 'ceph_rbd_write_ops', + 'ceph_rbd_read_ops', + 'ceph_rbd_write_bytes', + 'ceph_rbd_read_bytes', + 'ceph_rbd_write_latency_sum', + 'ceph_rbd_write_latency_count'
The Issue is resolved in 4.16 DF version. This is CSI issue that the cephfs SC created RWX PV are mounted as root permission. Not sure if it's related with the metric problem - since they are happened in similar environment. The issue need to be tested in the next release of 4.16 version.
Metrics are available on Provider-Client setup. --> VERIFIED Tested with ODF 4.16.0-94
Need to get the PR or fix-details related to this BZ. Added the rest of the RDT details
Added the fix details as well to RDT, please take a look
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591