Description of problem (please be detailed as possible and provide log snippests): After upgrading to 4.11.41, - 16 OSDs hosted on 2 nodes started firing Slow IOPs alerts the RC is Ceph back-filling operation hindered by Hardware, or network issue[1]. - The logic of the RC is aided by Slow IOPs situation where the number of PGs to be backfilled for a 1 OSD is very low (approx. 30 out of 4300+) the bandwidth throttling is set to the maximum; only 4 OSDs out of 224 are reporting slow iops. - Adding to RCA the cluster logs do not have any suspicious messages. -- comment #48 and #50 in the case show the customer is declaring any slow I/O operation on any OSD as a disaster-level incident that impacts all running workloads. The customer came to this determination because of past experiences and has not provided any supporting evidence on that claim -- In an effort to study the impact of the described by the customer case owner restored the Prometheus DB on a stand-alone Grafana instance[2] and tried to look at the data of the impact OSDs on the last slow IOPS period which was on June 12th from <11:54 am UTC> to <14:52 PM UTC> which was reported on the following OSDs osd.25 host: iabl22s02 osd.168 host: iabl20s05 osd.150 host: iabq20s04 osd.11 host: iabl21s05 -- By inspecting the Prometheus DB dump case owner did not find any resource starvation or break in the node exporter scrapping -- Found the CephBlock pool PG number is 1024 which is too low for a pool hosting 120 TB of data (i.e. 90% of the customer's usage) -- The customer is advised to increase the PG count on their pool -- In November 2022, 7 Months before the case was opened, a tuning procedure outlined in ODF case 03276122[1] was shared with the customer. The tuning would have helped lower the impact of the Slow IOPs on the customer's workloads but it was not implemented by the customer. 1. https://access.redhat.com/support/cases/03276122 2. http://<omitted- see case for details> Version of all relevant components (if applicable): OCP: v4.11.41 ODF: v4.11.8 Ceph: { "mon": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1 }, "osd": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 224 }, "mds": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1 }, "overall": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 231 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Production intermittent/IOPS/erratic OSD behavior. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4