Bug 2246185

Summary:	[Tracker Volsync] [RDR] Request to enable TCP keepalive timeout and lower its value in order to detect broken connection within 15mins
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Aman Agrawal <amagrawa>
Component:	odf-dr	Assignee:	Benamar Mekhissi <bmekhiss>
odf-dr sub component:	ramen	QA Contact:	krishnaram Karthick <kramdoss>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bmekhiss, muagarwa, prsurve
Version:	4.14
Target Milestone:	---
Target Release:	ODF 4.14.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-11-08 18:55:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2239587

Description Aman Agrawal 2023-10-25 17:45:38 UTC

Description of problem (please be detailed as possible and provide log
snippests): As a temporary fix for this issue https://bugzilla.redhat.com/show_bug.cgi?id=2239587#c19, we should default the stunnel TIMEOUTidle to 30mins from 12hrs as the hanged pvc gets reset after 12hrs as of now, and thus would lead to halted data sync for that cephfs workload for 12hr interval which is too much. The actual issue should be RCA and fixed as part of the original BZ2239587.  


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-10-18-004928
advanced-cluster-management.v2.9.0-188 
ODF 4.14.0-156
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Submariner   image: brew.registry.redhat.io/rh-osbs/iib:599799
ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 Aman Agrawal 2023-10-25 17:48:18 UTC

Proposing this as ODF 4.14 GA blocker looking at the importance of the temp. fix and it's impact on the volsync solution for cephfs backed workloads.

Comment 3 Mudit Agarwal 2023-10-26 13:48:31 UTC

Is this fix in ramen or volsync? Can we increase this timeout manually (as a workaround) and wait for the fix in 4.14.1?

Comment 4 Aman Agrawal 2023-10-26 16:19:02 UTC

(In reply to Mudit Agarwal from comment #3)
> Is this fix in ramen or volsync? Can we increase this timeout manually (as a
> workaround) and wait for the fix in 4.14.1?

Benamar can help us answer this, but we need this fix in 4.14 for cephfs GA.

Comment 5 Benamar Mekhissi 2023-10-30 13:18:09 UTC

We have asked the VolSync team to include the timeout in their final 0.8 release.

Comment 8 Benamar Mekhissi 2023-10-31 14:51:48 UTC

PR here: https://github.com/backube/volsync/pull/967

Comment 12 errata-xmlrpc 2023-11-08 18:55:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832