Description of problem (please be detailed as possible and provide log snippests): When making a query `cluster:ceph_disk_latency:join_ceph_node_disk_irate1m` via curl or via management-console it fails with response Issue appears mostly on External mode clusters and less frequently on internal cloud-based clusters AWS or IBM cloud deployments. Issue appears on all tested OCS versions 4.10 - 4.14. When cluster has such issue it is constant in reproduction. https://drive.google.com/file/d/19sOIma_WeXgo0494ZNJxrLPfeDPquVaf/view?usp=sharing Version of all relevant components (if applicable): OC version: Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-09-09-164123 Kubernetes Version: v1.27.4+6eeca63 OCS verison: ocs-operator.v4.14.0-129.stable OpenShift Container Storage 4.14.0-129.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-09-164123 True False 28h Cluster version is 4.14.0-0.nightly-2023-09-09-164123 Rook version: rook: v4.14.0-0.e185e93e09eaa5f6dfb81fa5383e30e137da7e0a go: go1.20.5 Ceph version: ceph version 17.2.6-120.el9cp (6fb9bb1d83813766a53a421c7bc80f7835bcaf6c) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes Can this issue reproduce from the UI? yes https://drive.google.com/file/d/19sOIma_WeXgo0494ZNJxrLPfeDPquVaf/view?usp=sharing If this is a regression, please provide more details to justify this: issue observed across all tested OCS versions. 10% of all tests (test_monitoring_reporting_ok_when_idle) across multiple cluster configuration failing with this problem Steps to Reproduce: 1. login to management console and 2. navigate to Observe / Metrics 3. send 'cluster:ceph_disk_latency:join_ceph_node_disk_irate1m' request or send the request via curl to Prometheus API endpoint Actual results: latency req failing. UI shows "No Datapoints found" Expected results: latency metrics available Additional info: must-gather logs: https://drive.google.com/file/d/1lcG4Dkrn9eHeAFuKY0VT6t_pTInsFfwM/view?usp=sharing
This appears to be a regression between 4.12 and 4.13
correction, this started appearing in 4.14, not 4.13
Thanks Avan. Made a PR to address the issue: https://github.com/red-hat-storage/ocs-operator/pull/2200
*** Bug 2242132 has been marked as a duplicate of this bug. ***
test test_monitoring_reporting_ok_when_idle PASS, latency metrics are visible via UI quay.io/rhceph-dev/ocs-registry:4.14.0-147 moving to VERIFY
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832