1916331 – [RFE]: PV performance metrics and OCS Observability

Bug 1916331 - [RFE]: PV performance metrics and OCS Observability

Summary: [RFE]: PV performance metrics and OCS Observability

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Anmol Sachan
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1779336
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-14 14:50 UTC by Sheldon Mustard
Modified:	2023-08-09 16:37 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-08 06:49:34 UTC
Embargoed:

Attachments	(Terms of Use)
requirements from customer (4.71 KB, application/pdf) 2021-01-14 14:50 UTC, Sheldon Mustard	no flags	Details
View All

Description Sheldon Mustard 2021-01-14 14:50:22 UTC

Created attachment 1747446 [details]
requirements from customer

Description of problem (please be detailed as possible and provide log
snippests):

Customer is requesting better observability related to OCS and PV performance.

The main objectives are:

1: To be able to determine the top talkers / noisy neighbor within the cluster within a single tool, without having to parse logs from individual systems

2: Ideal be able to set and protect a IO SLA

Version of all relevant components (if applicable):

OCP and OCS 4.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:

No

Steps to Reproduce:
1. Setup OCP and OCS
2. 
3.


Actual results:

No PV performance metrics

Expected results:

See PV performance metrics

Additional info:

Comment 3 Travis Nielsen 2021-01-15 20:42:49 UTC

Collecting the RBD image stats and exposing them in the dashboard should get us started.
The stats collection can be enabled at the pool level in Rook by setting a simple flag enableRBDStats [1].
This blog describes the metrics in more detail: https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/

Then they need to be collected from Prometheus and displayed in the dashboard.

Depending on the overhead of gathering the metrics, seems like we should enable the metrics for new pools by default in OCS. Pools created by the UI could also expose this option.

[1] https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/pool.yaml#L31-L33

Comment 4 Jose A. Rivera 2021-01-20 14:56:18 UTC

I believe this is targeted for OCS 4.8. Moving the component to monitoring.

Comment 5 Martin Bukatovic 2021-01-20 22:39:24 UTC

(In reply to Travis Nielsen from comment #3)
> Collecting the RBD image stats and exposing them in the dashboard should get
> us started.
> The stats collection can be enabled at the pool level in Rook by setting a
> simple flag enableRBDStats [1].
> This blog describes the metrics in more detail:
> https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/

Based on the above, I'm reopening BZ 1779336 and marking it as a blocker for
this bug.

Note You need to log in before you can comment on or make changes to this bug.