Description of problem (please be detailed as possible and provide log
snippests):
One of the node's had a network issue - the lync went up and down, image will be attached.
This resulted the following degradation on the storage health:
1. continually restarts - between 30 to 60 of the OSD on this server and long heartbeat ping time
2. continually Rebuilding data resiliency
3. multiple error's of "ocs-operator-7dcf7b7c77-79jc8 Readiness probe error: HTTP probe failed with statuscode: 500 body: [-]readyz failed: reason withheld healthz check failed"
4. multiple error's of "Disk device /dev/sdb not responding, on host {hostname}"
Version of all relevant components (if applicable):
OCP 4.8 CNV 4.8
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, mass migration fail with timeout's
Is there any workaround available to the best of your knowledge?
No
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4
Can this issue reproducible?
yes
Can this issue reproduce from the UI?
no
Steps to Reproduce:
1. Deploy OCP 4.8 RC3 with CNV 4.8
2. Deploy local storage on 12 nodes with 3 disks each
3. Deploy OCS OTO it
4. Continually restart the network on one node
Actual results:
Storage health is degraded
Expected results:
All OSD's should stop with failed mode after several heartbeat failures
Additional info:
Will be attached
Description of problem (please be detailed as possible and provide log snippests): One of the node's had a network issue - the lync went up and down, image will be attached. This resulted the following degradation on the storage health: 1. continually restarts - between 30 to 60 of the OSD on this server and long heartbeat ping time 2. continually Rebuilding data resiliency 3. multiple error's of "ocs-operator-7dcf7b7c77-79jc8 Readiness probe error: HTTP probe failed with statuscode: 500 body: [-]readyz failed: reason withheld healthz check failed" 4. multiple error's of "Disk device /dev/sdb not responding, on host {hostname}" Version of all relevant components (if applicable): OCP 4.8 CNV 4.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, mass migration fail with timeout's Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? yes Can this issue reproduce from the UI? no Steps to Reproduce: 1. Deploy OCP 4.8 RC3 with CNV 4.8 2. Deploy local storage on 12 nodes with 3 disks each 3. Deploy OCS OTO it 4. Continually restart the network on one node Actual results: Storage health is degraded Expected results: All OSD's should stop with failed mode after several heartbeat failures Additional info: Will be attached