Bug 1916331

Summary: [RFE]: PV performance metrics and OCS Observability
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sheldon Mustard <smustard>
Component: ceph-monitoringAssignee: Anmol Sachan <asachan>
Status: CLOSED WONTFIX QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.5CC: afrahman, asachan, bniver, etamir, gmeno, jarrpa, jbasquil, jhopper, mbukatov, muagarwa, nthomas, ocs-bugs, odf-bz-bot, owasserm, sostapov, tnielsen, ykaul
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-08 06:49:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1779336    
Bug Blocks:    
Attachments:
Description Flags
requirements from customer none

Description Sheldon Mustard 2021-01-14 14:50:22 UTC
Created attachment 1747446 [details]
requirements from customer

Description of problem (please be detailed as possible and provide log
snippests):

Customer is requesting better observability related to OCS and PV performance.

The main objectives are:

1: To be able to determine the top talkers / noisy neighbor within the cluster within a single tool, without having to parse logs from individual systems

2: Ideal be able to set and protect a IO SLA

Version of all relevant components (if applicable):

OCP and OCS 4.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:

No

Steps to Reproduce:
1. Setup OCP and OCS
2. 
3.


Actual results:

No PV performance metrics

Expected results:

See PV performance metrics

Additional info:

Comment 3 Travis Nielsen 2021-01-15 20:42:49 UTC
Collecting the RBD image stats and exposing them in the dashboard should get us started.
The stats collection can be enabled at the pool level in Rook by setting a simple flag enableRBDStats [1].
This blog describes the metrics in more detail: https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/

Then they need to be collected from Prometheus and displayed in the dashboard.

Depending on the overhead of gathering the metrics, seems like we should enable the metrics for new pools by default in OCS. Pools created by the UI could also expose this option.

[1] https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/pool.yaml#L31-L33

Comment 4 Jose A. Rivera 2021-01-20 14:56:18 UTC
I believe this is targeted for OCS 4.8. Moving the component to monitoring.

Comment 5 Martin Bukatovic 2021-01-20 22:39:24 UTC
(In reply to Travis Nielsen from comment #3)
> Collecting the RBD image stats and exposing them in the dashboard should get
> us started.
> The stats collection can be enabled at the pool level in Rook by setting a
> simple flag enableRBDStats [1].
> This blog describes the metrics in more detail:
> https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/

Based on the above, I'm reopening BZ 1779336 and marking it as a blocker for
this bug.