Bug 2238400 - latency metrics Prometheus request is failing
Summary: latency metrics Prometheus request is failing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.14.0
Assignee: arun kumar mohan
QA Contact: Daniel Osypenko
URL:
Whiteboard:
: 2242132 (view as bug list)
Depends On:
Blocks: 2244421
TreeView+ depends on / blocked
 
Reported: 2023-09-11 17:08 UTC by Daniel Osypenko
Modified: 2023-11-08 18:54 UTC (History)
7 users (show)

Fixed In Version: 4.14.0-147
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2244421 (view as bug list)
Environment:
Last Closed: 2023-11-08 18:54:25 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2200 0 None open Correcting the job names in prometheus metrics 2023-09-28 12:35:22 UTC
Github red-hat-storage ocs-operator pull 2210 0 None open Bug 2238400: [release-4.14] Correcting the job names in prometheus metrics 2023-10-10 12:03:56 UTC
Red Hat Product Errata RHSA-2023:6832 0 None None None 2023-11-08 18:54:49 UTC

Description Daniel Osypenko 2023-09-11 17:08:24 UTC
Description of problem (please be detailed as possible and provide log
snippests):

When making a query `cluster:ceph_disk_latency:join_ceph_node_disk_irate1m` via curl or via management-console it fails with response

Issue appears mostly on External mode clusters and less frequently on internal cloud-based clusters AWS or IBM cloud deployments.
Issue appears on all tested OCS versions 4.10 - 4.14.
When cluster has such issue it is constant in reproduction. 

https://drive.google.com/file/d/19sOIma_WeXgo0494ZNJxrLPfeDPquVaf/view?usp=sharing

Version of all relevant components (if applicable):

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-09-09-164123
Kubernetes Version: v1.27.4+6eeca63

OCS verison:
ocs-operator.v4.14.0-129.stable              OpenShift Container Storage   4.14.0-129.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-09-09-164123   True        False         28h     Cluster version is 4.14.0-0.nightly-2023-09-09-164123

Rook version:
rook: v4.14.0-0.e185e93e09eaa5f6dfb81fa5383e30e137da7e0a
go: go1.20.5

Ceph version:
ceph version 17.2.6-120.el9cp (6fb9bb1d83813766a53a421c7bc80f7835bcaf6c) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

yes

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes https://drive.google.com/file/d/19sOIma_WeXgo0494ZNJxrLPfeDPquVaf/view?usp=sharing

If this is a regression, please provide more details to justify this:
issue observed across all tested OCS versions. 10% of all tests (test_monitoring_reporting_ok_when_idle) across multiple cluster configuration failing with this problem


Steps to Reproduce:
1. login to management console and 
2. navigate to Observe / Metrics
3. send 'cluster:ceph_disk_latency:join_ceph_node_disk_irate1m' request 

or send the request via curl to Prometheus API endpoint

Actual results:
latency req failing. UI shows "No Datapoints found"

Expected results:
latency metrics available

Additional info:
must-gather logs: https://drive.google.com/file/d/1lcG4Dkrn9eHeAFuKY0VT6t_pTInsFfwM/view?usp=sharing

Comment 3 Elad 2023-09-26 08:47:53 UTC
This appears to be a regression between 4.12 and 4.13

Comment 5 Elad 2023-09-26 08:55:34 UTC
correction, this started appearing in 4.14, not 4.13

Comment 10 arun kumar mohan 2023-09-28 12:35:23 UTC
Thanks Avan.
Made a PR to address the issue: https://github.com/red-hat-storage/ocs-operator/pull/2200

Comment 11 Juan Miguel Olmo 2023-10-09 09:43:30 UTC
*** Bug 2242132 has been marked as a duplicate of this bug. ***

Comment 18 Daniel Osypenko 2023-10-11 09:16:45 UTC
test test_monitoring_reporting_ok_when_idle PASS, 
latency metrics are visible via UI
quay.io/rhceph-dev/ocs-registry:4.14.0-147

moving to VERIFY

Comment 20 errata-xmlrpc 2023-11-08 18:54:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832


Note You need to log in before you can comment on or make changes to this bug.