Bug 1989866

Summary: Storage performance degradation after network failure
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: guy chen <guchen>
Component: cephAssignee: Neha Ojha <nojha>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: unspecified CC: alitke, bniver, fdeutsch, gmeno, madam, muagarwa, nberry, nojha, ocs-bugs, odf-bz-bot, owasserm, pdhange, pdhiran, pnataraj, sostapov, ssonigra, vumrao, ycui
Version: 4.8Keywords: AutomationBackLog
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-26 03:12:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description guy chen 2021-08-04 08:47:17 UTC
Description of problem (please be detailed as possible and provide log
snippests):

One of the node's had a network issue - the lync went up and down, image will be attached.
This resulted the following degradation on the storage health: 
1. continually restarts - between 30 to 60 of the OSD on this server and long heartbeat ping time 
2. continually Rebuilding data resiliency
3. multiple error's of "ocs-operator-7dcf7b7c77-79jc8 Readiness probe error: HTTP probe failed with statuscode: 500 body: [-]readyz failed: reason withheld healthz check failed"
4. multiple error's of "Disk device /dev/sdb not responding, on host {hostname}"


Version of all relevant components (if applicable):
OCP 4.8 CNV 4.8

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, mass migration fail with timeout's

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no


Steps to Reproduce:
1. Deploy OCP 4.8 RC3 with CNV 4.8
2. Deploy local storage on 12 nodes with 3 disks each
3. Deploy OCS OTO it
4. Continually restart the network on one node


Actual results:
Storage health is degraded

Expected results:
All OSD's should stop with failed mode after several heartbeat failures

Additional info:
Will be attached

Comment 21 Mudit Agarwal 2022-10-26 03:12:07 UTC
This work is being tracked by a Jira epic, closing the BZ.