Bug 2223778

Summary: Slow IOPs Alerts with Ceph back-filling Operations Hindered by Hardware or Network Affecting All Running Workloads
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Craig Wayman <crwayman>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: NEW --- QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: hklein, linuxkidd, odf-bz-bot, rzarzyns, tnielsen, vumrao
Target Milestone: ---Flags: tnielsen: needinfo? (crwayman)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Craig Wayman 2023-07-18 21:27:55 UTC
Description of problem (please be detailed as possible and provide log
snippests):

After upgrading to 4.11.41, 
- 16 OSDs hosted on 2 nodes started firing Slow IOPs alerts the RC is Ceph back-filling operation hindered by Hardware, or network issue[1]. 
- The logic of the RC is aided by Slow IOPs situation where the number of PGs to be backfilled for a 1 OSD is very low (approx. 30 out of 4300+) the bandwidth throttling is set to the maximum; only 4 OSDs out of 224 are reporting slow iops. 
- Adding to RCA the cluster logs do not have any suspicious messages.

-- comment #48 and #50 in the case show the customer is declaring any slow I/O operation on any OSD as a disaster-level incident that impacts all running workloads. The customer came to this determination because of past experiences and has not provided any supporting evidence on that claim

-- In an effort to study the impact of the described by the customer case owner restored the Prometheus DB on a stand-alone Grafana instance[2] and tried to look at the data of the impact OSDs on the last slow IOPS period which was on June 12th from <11:54 am UTC> to <14:52 PM UTC> which was reported on the following OSDs

osd.25 host:  iabl22s02
osd.168 host: iabl20s05
osd.150 host:  iabq20s04
osd.11  host: iabl21s05 

-- By inspecting the Prometheus DB dump case owner did not find any resource starvation or break in the node exporter scrapping 

-- Found the CephBlock pool PG number is 1024 which is too low for a pool hosting 120 TB of data (i.e. 90% of the customer's usage)

-- The customer is advised to increase the PG count on their pool 

-- In November 2022, 7 Months before the case was opened, a tuning procedure outlined in ODF case 03276122[1] was shared with the customer. The tuning would have helped lower the impact of the Slow IOPs on the customer's workloads but it was not implemented by the customer.


1. https://access.redhat.com/support/cases/03276122
2. http://<omitted- see case for details>



Version of all relevant components (if applicable):

OCP: v4.11.41

ODF: v4.11.8

Ceph:

{
    "mon": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 224
    },
    "mds": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 231
    }
}



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

  Production intermittent/IOPS/erratic OSD behavior.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4