Bug 2246185

Summary: [Tracker Volsync] [RDR] Request to enable TCP keepalive timeout and lower its value in order to detect broken connection within 15mins
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: odf-drAssignee: Benamar Mekhissi <bmekhiss>
odf-dr sub component: ramen QA Contact: krishnaram Karthick <kramdoss>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: bmekhiss, muagarwa, prsurve
Version: 4.14   
Target Milestone: ---   
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 18:55:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2239587    

Description Aman Agrawal 2023-10-25 17:45:38 UTC
Description of problem (please be detailed as possible and provide log
snippests): As a temporary fix for this issue https://bugzilla.redhat.com/show_bug.cgi?id=2239587#c19, we should default the stunnel TIMEOUTidle to 30mins from 12hrs as the hanged pvc gets reset after 12hrs as of now, and thus would lead to halted data sync for that cephfs workload for 12hr interval which is too much. The actual issue should be RCA and fixed as part of the original BZ2239587.  


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-10-18-004928
advanced-cluster-management.v2.9.0-188 
ODF 4.14.0-156
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Submariner   image: brew.registry.redhat.io/rh-osbs/iib:599799
ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 Aman Agrawal 2023-10-25 17:48:18 UTC
Proposing this as ODF 4.14 GA blocker looking at the importance of the temp. fix and it's impact on the volsync solution for cephfs backed workloads.

Comment 3 Mudit Agarwal 2023-10-26 13:48:31 UTC
Is this fix in ramen or volsync? Can we increase this timeout manually (as a workaround) and wait for the fix in 4.14.1?

Comment 4 Aman Agrawal 2023-10-26 16:19:02 UTC
(In reply to Mudit Agarwal from comment #3)
> Is this fix in ramen or volsync? Can we increase this timeout manually (as a
> workaround) and wait for the fix in 4.14.1?

Benamar can help us answer this, but we need this fix in 4.14 for cephfs GA.

Comment 5 Benamar Mekhissi 2023-10-30 13:18:09 UTC
We have asked the VolSync team to include the timeout in their final 0.8 release.

Comment 8 Benamar Mekhissi 2023-10-31 14:51:48 UTC
PR here: https://github.com/backube/volsync/pull/967

Comment 12 errata-xmlrpc 2023-11-08 18:55:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832