2223778 – Slow IOPs Alerts with Ceph back-filling Operations Hindered by Hardware or Network Affecting All Running Workloads

Bug 2223778 - Slow IOPs Alerts with Ceph back-filling Operations Hindered by Hardware or Network Affecting All Running Workloads

Summary: Slow IOPs Alerts with Ceph back-filling Operations Hindered by Hardware or Ne...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-18 21:27 UTC by Craig Wayman
Modified:	2023-11-20 08:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-14 15:09:33 UTC
Embargoed:

Attachments	(Terms of Use)

Description Craig Wayman 2023-07-18 21:27:55 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After upgrading to 4.11.41, 
- 16 OSDs hosted on 2 nodes started firing Slow IOPs alerts the RC is Ceph back-filling operation hindered by Hardware, or network issue[1]. 
- The logic of the RC is aided by Slow IOPs situation where the number of PGs to be backfilled for a 1 OSD is very low (approx. 30 out of 4300+) the bandwidth throttling is set to the maximum; only 4 OSDs out of 224 are reporting slow iops. 
- Adding to RCA the cluster logs do not have any suspicious messages.

-- comment #48 and #50 in the case show the customer is declaring any slow I/O operation on any OSD as a disaster-level incident that impacts all running workloads. The customer came to this determination because of past experiences and has not provided any supporting evidence on that claim

-- In an effort to study the impact of the described by the customer case owner restored the Prometheus DB on a stand-alone Grafana instance[2] and tried to look at the data of the impact OSDs on the last slow IOPS period which was on June 12th from <11:54 am UTC> to <14:52 PM UTC> which was reported on the following OSDs

osd.25 host:  iabl22s02
osd.168 host: iabl20s05
osd.150 host:  iabq20s04
osd.11  host: iabl21s05 

-- By inspecting the Prometheus DB dump case owner did not find any resource starvation or break in the node exporter scrapping 

-- Found the CephBlock pool PG number is 1024 which is too low for a pool hosting 120 TB of data (i.e. 90% of the customer's usage)

-- The customer is advised to increase the PG count on their pool 

-- In November 2022, 7 Months before the case was opened, a tuning procedure outlined in ODF case 03276122[1] was shared with the customer. The tuning would have helped lower the impact of the Slow IOPs on the customer's workloads but it was not implemented by the customer.


1. https://access.redhat.com/support/cases/03276122
2. http://<omitted- see case for details>



Version of all relevant components (if applicable):

OCP: v4.11.41

ODF: v4.11.8

Ceph:

{
    "mon": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 224
    },
    "mds": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 231
    }
}



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

  Production intermittent/IOPS/erratic OSD behavior.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4

Comment 11 Santosh Pillai 2023-11-14 15:09:33 UTC

Closing it for now. Please reopen if its still an issue.

Note You need to log in before you can comment on or make changes to this bug.