Created attachment 1747446 [details] requirements from customer Description of problem (please be detailed as possible and provide log snippests): Customer is requesting better observability related to OCS and PV performance. The main objectives are: 1: To be able to determine the top talkers / noisy neighbor within the cluster within a single tool, without having to parse logs from individual systems 2: Ideal be able to set and protect a IO SLA Version of all relevant components (if applicable): OCP and OCS 4.5 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: No Steps to Reproduce: 1. Setup OCP and OCS 2. 3. Actual results: No PV performance metrics Expected results: See PV performance metrics Additional info:
Collecting the RBD image stats and exposing them in the dashboard should get us started. The stats collection can be enabled at the pool level in Rook by setting a simple flag enableRBDStats [1]. This blog describes the metrics in more detail: https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/ Then they need to be collected from Prometheus and displayed in the dashboard. Depending on the overhead of gathering the metrics, seems like we should enable the metrics for new pools by default in OCS. Pools created by the UI could also expose this option. [1] https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/pool.yaml#L31-L33
I believe this is targeted for OCS 4.8. Moving the component to monitoring.
(In reply to Travis Nielsen from comment #3) > Collecting the RBD image stats and exposing them in the dashboard should get > us started. > The stats collection can be enabled at the pool level in Rook by setting a > simple flag enableRBDStats [1]. > This blog describes the metrics in more detail: > https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/ Based on the above, I'm reopening BZ 1779336 and marking it as a blocker for this bug.