Bug 2267067
Summary: | rbd metrics are not available on Provider-Client cluster | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Filip Balák <fbalak> |
Component: | ceph-monitoring | Assignee: | mrudraia |
Status: | CLOSED ERRATA | QA Contact: | Filip Balák <fbalak> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.14 | CC: | amohan, asriram, kbg, mrudraia, muagarwa, nthomas, odf-bz-bot, rchikatw, resoni, rohgupta |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | ODF 4.16.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | isf-provider | ||
Fixed In Version: | 4.16.0-94 | Doc Type: | Bug Fix |
Doc Text: |
.RBD metrics not available on Provider-Client cluster
Previously, RBD metrics were not populated in Fusion HCI based OpenShift Data Foundation provider client clusters as a CSI issue caused all the RWX CephFS storage class PVs to mount with root permission. This caused an authentication issue with `ceph-mgr` while checking the RBD pool stats.
With this fix, the Ceph cluster misconfiguration that caused the issue was corrected and as a result all the RBD related metrics are available with the provider client clusters.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2024-07-17 13:14:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2260844 |
Description
Filip Balák
2024-02-29 14:05:21 UTC
This will be brought to the PM and then a decision will be made to include these metrics to provider client mode @fbalak , only the 6 metrics mentioned above are missing? Any other rbd metrics are impacted? It looks like only those 6. Another test also failed with rgw metrics: 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rgw_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_metadata', 'ceph_rgw_qactive', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req','ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_get', 'ceph_rgw_cache_hit', 'ceph_rgw_put_b' but I tried them manually after the test and those data points provided data. When I tried manually those rbd metrics, they were missing as seen in the test. Those metrics were checked directly from Prometheus and from Metrics page in OpenShift. The workaround mentioned (ceph config set mgr mgr/prometheus/rbd_stats_pools "*") was not tried, only provided as info from https://bugzilla.redhat.com/show_bug.cgi?id=2237412#c4. I had similar conversation with Shay regarding the ceph rbd pool metrics. He had to provide the above (said) command to get the stats: ceph config set mgr mgr/prometheus/rbd_stats_pools "*" More info in the doc: https://docs.ceph.com/en/quincy/mgr/prometheus/#rbd-io-statistics I will install fresh cluster without CI and check if those metrics are present to make sure that we don't add them with automation in internal mode. I have installed fresh cluster on AWS ipi platform and manually installed operator odf-operator.v4.14.5-rhodf and storage system. Metrics listed in description are present without any other action from user. Closing the BZ as per the above comment#8. Please re-open if required. Reopening because listed metrics are still not available on Provider client platform. This behaviour is different from internal mode platforms where they are present without additional user interaction. Tested again with odf 4.14.6-1 on Provider-Client cluster. Checked the issue in Fusion HCI cluster. It has been onbserved some misconfigurations in CEPH cluster, and the ceph cluster is not healthy, Ceph helath shown warn condition. Some of the mon clients have authenication error. Detaliled RCA in the attached docs- https://docs.google.com/document/d/134jDs3KEM6Q6h3u4kOwOVpEQ0HJ4S3kiVy_mtyen-WI/edit?usp=sharing . Waiting for CEPH team to check this issue. Checked the RBD metrics on IBM provider - client Bare metal cluster. The metrics are available. + 'ceph_rbd_write_ops', + 'ceph_rbd_read_ops', + 'ceph_rbd_write_bytes', + 'ceph_rbd_read_bytes', + 'ceph_rbd_write_latency_sum', + 'ceph_rbd_write_latency_count' The Issue is resolved in 4.16 DF version. This is CSI issue that the cephfs SC created RWX PV are mounted as root permission. Not sure if it's related with the metric problem - since they are happened in similar environment. The issue need to be tested in the next release of 4.16 version. Metrics are available on Provider-Client setup. --> VERIFIED Tested with ODF 4.16.0-94 Need to get the PR or fix-details related to this BZ. Added the rest of the RDT details Added the fix details as well to RDT, please take a look Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591 |