1989866 – Storage performance degradation after network failure

Bug 1989866 - Storage performance degradation after network failure

Summary: Storage performance degradation after network failure

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Neha Ojha
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-04 08:47 UTC by guy chen
Modified:	2023-08-09 16:37 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-26 03:12:07 UTC
Embargoed:

Attachments	(Terms of Use)

Description guy chen 2021-08-04 08:47:17 UTC

Description of problem (please be detailed as possible and provide log
snippests):

One of the node's had a network issue - the lync went up and down, image will be attached.
This resulted the following degradation on the storage health: 
1. continually restarts - between 30 to 60 of the OSD on this server and long heartbeat ping time 
2. continually Rebuilding data resiliency
3. multiple error's of "ocs-operator-7dcf7b7c77-79jc8 Readiness probe error: HTTP probe failed with statuscode: 500 body: [-]readyz failed: reason withheld healthz check failed"
4. multiple error's of "Disk device /dev/sdb not responding, on host {hostname}"


Version of all relevant components (if applicable):
OCP 4.8 CNV 4.8

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, mass migration fail with timeout's

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no


Steps to Reproduce:
1. Deploy OCP 4.8 RC3 with CNV 4.8
2. Deploy local storage on 12 nodes with 3 disks each
3. Deploy OCS OTO it
4. Continually restart the network on one node


Actual results:
Storage health is degraded

Expected results:
All OSD's should stop with failed mode after several heartbeat failures

Additional info:
Will be attached

Comment 21 Mudit Agarwal 2022-10-26 03:12:07 UTC

This work is being tracked by a Jira epic, closing the BZ.

Note You need to log in before you can comment on or make changes to this bug.