Bug 1916331

Summary:

[RFE]: PV performance metrics and OCS Observability

Product:

[Red Hat Storage] Red Hat OpenShift Data Foundation

Reporter:

Sheldon Mustard <smustard>

Component:

ceph-monitoring

Assignee:

Anmol Sachan <asachan>

Status:

CLOSED WONTFIX

QA Contact:

Elad <ebenahar>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.5

CC:

afrahman, asachan, bniver, etamir, gmeno, jarrpa, jbasquil, jhopper, mbukatov, muagarwa, nthomas, ocs-bugs, odf-bz-bot, owasserm, sostapov, tnielsen, ykaul

Target Milestone:

---

Keywords:

FutureFeature

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-06-08 06:49:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1779336

Bug Blocks:

Attachments:

Description	Flags
requirements from customer	none

Description Sheldon Mustard 2021-01-14 14:50:22 UTC

Created attachment 1747446 [details]
requirements from customer

Description of problem (please be detailed as possible and provide log
snippests):

Customer is requesting better observability related to OCS and PV performance.

The main objectives are:

1: To be able to determine the top talkers / noisy neighbor within the cluster within a single tool, without having to parse logs from individual systems

2: Ideal be able to set and protect a IO SLA

Version of all relevant components (if applicable):

OCP and OCS 4.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:

No

Steps to Reproduce:
1. Setup OCP and OCS
2. 
3.


Actual results:

No PV performance metrics

Expected results:

See PV performance metrics

Additional info:

Comment 3 Travis Nielsen 2021-01-15 20:42:49 UTC

Collecting the RBD image stats and exposing them in the dashboard should get us started.
The stats collection can be enabled at the pool level in Rook by setting a simple flag enableRBDStats [1].
This blog describes the metrics in more detail: https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/

Then they need to be collected from Prometheus and displayed in the dashboard.

Depending on the overhead of gathering the metrics, seems like we should enable the metrics for new pools by default in OCS. Pools created by the UI could also expose this option.

[1] https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/pool.yaml#L31-L33

Comment 4 Jose A. Rivera 2021-01-20 14:56:18 UTC

I believe this is targeted for OCS 4.8. Moving the component to monitoring.

Comment 5 Martin Bukatovic 2021-01-20 22:39:24 UTC

(In reply to Travis Nielsen from comment #3)
> Collecting the RBD image stats and exposing them in the dashboard should get
> us started.
> The stats collection can be enabled at the pool level in Rook by setting a
> simple flag enableRBDStats [1].
> This blog describes the metrics in more detail:
> https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/

Based on the above, I'm reopening BZ 1779336 and marking it as a blocker for
this bug.