2166354 – [RDR][CEPHFS][Tracker] sync/replication is getting stopped for some pvc rsync: connection unexpectedly closed

Bug 2166354 - [RDR][CEPHFS][Tracker] sync/replication is getting stopped for some pvc rsync: connection unexpectedly closed

Summary: [RDR][CEPHFS][Tracker] sync/replication is getting stopped for some pvc rsync...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.14.0
Assignee:	Vishal Thapar
QA Contact:	Pratik Surve
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2244409
TreeView+	depends on / blocked

Reported:	2023-02-01 14:50 UTC by Pratik Surve
Modified:	2024-03-08 04:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 18:49:53 UTC
Embargoed:
Flags:	vthapar: needinfo- vthapar: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2023:6832	0	None	None	None	2023-11-08 18:50:51 UTC

Description Pratik Surve 2023-02-01 14:50:04 UTC

Description of problem (please be detailed as possible and provide log
snippests):
[RDR][CEPHFS] sync/replication is getting stopped for some pvc rsync: connection unexpectedly closed

Version of all relevant components (if applicable):

OCP version:- 4.12.0-0.nightly-2023-01-19-110743
ODF version:- 4.12.0-167
CEPH version:- ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)
ACM version:- v2.7.0
SUBMARINER version:- v0.14.1
VOLSYNC version:- volsync-product.v0.6.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster
2.Create CEPHFS workload
3.check replicationsource status after 4-5 days 


Actual results:
oc get replicationsources --all-namespaces

NAMESPACE             NAME             SOURCE           LAST SYNC              DURATION          NEXT SYNC
busybox-workloads-1   busybox-pvc-1    busybox-pvc-1    2023-01-31T18:29:42Z   4m42.43785306s    2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-10   busybox-pvc-10   2023-01-31T18:29:44Z   4m44.270959207s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-11   busybox-pvc-11   2023-01-31T18:29:39Z   4m39.561215075s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-12   busybox-pvc-12   2023-01-31T18:29:39Z   4m39.505161496s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-13   busybox-pvc-13   2023-01-31T18:29:39Z   4m39.392102602s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-14   busybox-pvc-14   2023-01-29T09:23:46Z   3m46.2123443s     2023-01-29T09:25:00Z
busybox-workloads-1   busybox-pvc-15   busybox-pvc-15   2023-01-31T18:29:38Z   4m38.684680219s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-16   busybox-pvc-16   2023-01-31T18:29:44Z   4m44.940480011s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-17   busybox-pvc-17   2023-01-31T18:29:46Z   4m46.939628275s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-18   busybox-pvc-18   2023-01-31T18:29:49Z   4m49.386471097s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-19   busybox-pvc-19   2023-01-31T18:29:46Z   4m46.9576977s     2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-2    busybox-pvc-2    2023-01-31T18:29:44Z   4m44.210365128s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-20   busybox-pvc-20   2023-01-31T18:29:20Z   4m20.176716233s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-3    busybox-pvc-3    2023-01-31T18:29:44Z   4m44.252856471s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-4    busybox-pvc-4    2023-01-31T18:29:42Z   4m42.463537101s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-5    busybox-pvc-5    2023-01-31T18:29:39Z   4m39.544973999s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-6    busybox-pvc-6    2023-01-31T18:29:41Z   4m41.734123311s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-7    busybox-pvc-7    2023-01-31T18:29:24Z   4m24.746735678s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-8    busybox-pvc-8    2023-01-31T18:29:23Z   4m23.763610806s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-9    busybox-pvc-9    2023-01-31T18:29:24Z   4m24.778197239s   2023-01-31T18:30:00Z
busybox-workloads-1   mysql-pv-claim   mysql-pv-claim   2023-01-31T18:25:49Z   5m49.531404022s   2023-01-31T18:30:00Z



$oc logs volsync-rsync-src-busybox-pvc-14-xjxj6                                                    
VolSync rsync container version: ACM-0.6.0-ce9a280
Syncing data to volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local:22 ...
ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Syncronization failed. Retrying in 2 seconds. Retry 1/5.
ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Syncronization failed. Retrying in 4 seconds. Retry 2/5.
ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Syncronization failed. Retrying in 8 seconds. Retry 3/5.
ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Syncronization failed. Retrying in 16 seconds. Retry 4/5.


$replication is stop for volsync-rsync-dst-busybox-pvc-14-pz5jk   1/1       Running   0          2d9h



Expected results:


Additional info:

Comment 11 Vishal Thapar 2023-02-08 06:16:05 UTC

Issue root caused to be same as https://github.com/submariner-io/lighthouse/pull/964

This is list of serviceimports on broker:

kubectl get serviceimports -n submariner-broker |grep vmware-dccp-one
volsync-rsync-dst-dd-io-pvc-1-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.223.46"]    15d
volsync-rsync-dst-dd-io-pvc-1-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.159.0"]     6d16h
volsync-rsync-dst-dd-io-pvc-2-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.193.46"]    15d
volsync-rsync-dst-dd-io-pvc-2-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.221.179"]   6d16h
volsync-rsync-dst-dd-io-pvc-3-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.234.101"]   15d
volsync-rsync-dst-dd-io-pvc-3-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.214.203"]   6d16h
volsync-rsync-dst-dd-io-pvc-4-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.129.69"]    15d
volsync-rsync-dst-dd-io-pvc-4-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.117.108"]   6d16h
volsync-rsync-dst-dd-io-pvc-5-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.169.148"]   15d
volsync-rsync-dst-dd-io-pvc-5-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.18.31"]     6d16h
volsync-rsync-dst-dd-io-pvc-6-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.199.117"]   15d
volsync-rsync-dst-dd-io-pvc-6-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.81.47"]     6d16h
volsync-rsync-dst-dd-io-pvc-7-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.58.224"]    15d
volsync-rsync-dst-dd-io-pvc-7-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.244.154"]   6d16h


This is list of endpointslices:

kubectl get endpointslices -n submariner-broker |grep vmware-dccp-one
volsync-rsync-dst-dd-io-pvc-1-vmware-dccp-one   IPv4          8022      10.131.1.64    15d
volsync-rsync-dst-dd-io-pvc-2-vmware-dccp-one   IPv4          8022      10.131.1.45    15d
volsync-rsync-dst-dd-io-pvc-3-vmware-dccp-one   IPv4          8022      10.131.1.70    15d
volsync-rsync-dst-dd-io-pvc-4-vmware-dccp-one   IPv4          8022      10.128.3.194   15d
volsync-rsync-dst-dd-io-pvc-5-vmware-dccp-one   IPv4          8022      10.131.1.48    15d
volsync-rsync-dst-dd-io-pvc-6-vmware-dccp-one   IPv4          8022      10.129.2.138   15d
volsync-rsync-dst-dd-io-pvc-7-vmware-dccp-one   IPv4          8022      10.131.1.49    15d

This causes endpointslice information on dst cluster to flip. In lghthouse Coredns we also use the namespace information when replying to queries. Depending on which endpointslice is currently synced from broker, queries can fail.

Not familiar enough with sync/replication solution to give hypothesis for why failure is not too frequent.

A workaround to try for now would be to avoid using same servicename across namespaces. If this workaround works, it will also confirm the issue and we can work for getting the fix into ACM 2.7. Currently fix is only in 0.15.0 and won't land until 2.8.

Comment 13 krishnaram Karthick 2023-04-05 03:44:29 UTC

requesting "requires_doc_text" as the fix won't land in 4.13 timeframe.

Comment 17 Vishal Thapar 2023-09-07 08:18:09 UTC

Fix available in Submariner 0.16.0 which will be bundled with ACM 2.9

Comment 23 errata-xmlrpc 2023-11-08 18:49:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Comment 24 Red Hat Bugzilla 2024-03-08 04:25:19 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.