Bug 2166354

Summary: [RDR][CEPHFS][Tracker] sync/replication is getting stopped for some pvc rsync: connection unexpectedly closed
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Pratik Surve <prsurve>
Component: odf-drAssignee: Vishal Thapar <vthapar>
odf-dr sub component: ramen QA Contact: krishnaram Karthick <kramdoss>
Status: ASSIGNED --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bmekhiss, kseeger, muagarwa, nyechiel, odf-bz-bot, rtalur, vthapar
Version: 4.12Flags: vthapar: needinfo-
vthapar: needinfo-
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: Endpointslices synced to broker don't use source namespace and are stored in broker namespace. When two different clusters export same service name but different namespace only one of them can exist in broker namespace Consequence: Only one Endpointslice will be synced to remote cluster. As different clusters try to keep syncing their endpointslices, one in broker keeps flipping to service from different clusters. Depending on which EPSlice is currently synced, DNS queries for one not synced will fail. Workaround (if any): Don't use same service name but different namespace when exporting service in different clusters. If it is essentially same service, use same namespace. If not, use a different service name. Result: Queries to one of the service can intermittently fail.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pratik Surve 2023-02-01 14:50:04 UTC
Description of problem (please be detailed as possible and provide log
snippests):
[RDR][CEPHFS] sync/replication is getting stopped for some pvc rsync: connection unexpectedly closed

Version of all relevant components (if applicable):

OCP version:- 4.12.0-0.nightly-2023-01-19-110743
ODF version:- 4.12.0-167
CEPH version:- ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)
ACM version:- v2.7.0
SUBMARINER version:- v0.14.1
VOLSYNC version:- volsync-product.v0.6.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster
2.Create CEPHFS workload
3.check replicationsource status after 4-5 days 


Actual results:
oc get replicationsources --all-namespaces

NAMESPACE             NAME             SOURCE           LAST SYNC              DURATION          NEXT SYNC
busybox-workloads-1   busybox-pvc-1    busybox-pvc-1    2023-01-31T18:29:42Z   4m42.43785306s    2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-10   busybox-pvc-10   2023-01-31T18:29:44Z   4m44.270959207s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-11   busybox-pvc-11   2023-01-31T18:29:39Z   4m39.561215075s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-12   busybox-pvc-12   2023-01-31T18:29:39Z   4m39.505161496s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-13   busybox-pvc-13   2023-01-31T18:29:39Z   4m39.392102602s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-14   busybox-pvc-14   2023-01-29T09:23:46Z   3m46.2123443s     2023-01-29T09:25:00Z
busybox-workloads-1   busybox-pvc-15   busybox-pvc-15   2023-01-31T18:29:38Z   4m38.684680219s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-16   busybox-pvc-16   2023-01-31T18:29:44Z   4m44.940480011s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-17   busybox-pvc-17   2023-01-31T18:29:46Z   4m46.939628275s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-18   busybox-pvc-18   2023-01-31T18:29:49Z   4m49.386471097s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-19   busybox-pvc-19   2023-01-31T18:29:46Z   4m46.9576977s     2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-2    busybox-pvc-2    2023-01-31T18:29:44Z   4m44.210365128s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-20   busybox-pvc-20   2023-01-31T18:29:20Z   4m20.176716233s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-3    busybox-pvc-3    2023-01-31T18:29:44Z   4m44.252856471s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-4    busybox-pvc-4    2023-01-31T18:29:42Z   4m42.463537101s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-5    busybox-pvc-5    2023-01-31T18:29:39Z   4m39.544973999s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-6    busybox-pvc-6    2023-01-31T18:29:41Z   4m41.734123311s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-7    busybox-pvc-7    2023-01-31T18:29:24Z   4m24.746735678s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-8    busybox-pvc-8    2023-01-31T18:29:23Z   4m23.763610806s   2023-01-31T18:30:00Z
busybox-workloads-1   busybox-pvc-9    busybox-pvc-9    2023-01-31T18:29:24Z   4m24.778197239s   2023-01-31T18:30:00Z
busybox-workloads-1   mysql-pv-claim   mysql-pv-claim   2023-01-31T18:25:49Z   5m49.531404022s   2023-01-31T18:30:00Z



$oc logs volsync-rsync-src-busybox-pvc-14-xjxj6                                                    
VolSync rsync container version: ACM-0.6.0-ce9a280
Syncing data to volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local:22 ...
ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Syncronization failed. Retrying in 2 seconds. Retry 1/5.
ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Syncronization failed. Retrying in 4 seconds. Retry 2/5.
ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Syncronization failed. Retrying in 8 seconds. Retry 3/5.
ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Syncronization failed. Retrying in 16 seconds. Retry 4/5.


$replication is stop for volsync-rsync-dst-busybox-pvc-14-pz5jk   1/1       Running   0          2d9h



Expected results:


Additional info:

Comment 11 Vishal Thapar 2023-02-08 06:16:05 UTC
Issue root caused to be same as https://github.com/submariner-io/lighthouse/pull/964

This is list of serviceimports on broker:

kubectl get serviceimports -n submariner-broker |grep vmware-dccp-one
volsync-rsync-dst-dd-io-pvc-1-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.223.46"]    15d
volsync-rsync-dst-dd-io-pvc-1-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.159.0"]     6d16h
volsync-rsync-dst-dd-io-pvc-2-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.193.46"]    15d
volsync-rsync-dst-dd-io-pvc-2-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.221.179"]   6d16h
volsync-rsync-dst-dd-io-pvc-3-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.234.101"]   15d
volsync-rsync-dst-dd-io-pvc-3-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.214.203"]   6d16h
volsync-rsync-dst-dd-io-pvc-4-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.129.69"]    15d
volsync-rsync-dst-dd-io-pvc-4-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.117.108"]   6d16h
volsync-rsync-dst-dd-io-pvc-5-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.169.148"]   15d
volsync-rsync-dst-dd-io-pvc-5-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.18.31"]     6d16h
volsync-rsync-dst-dd-io-pvc-6-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.199.117"]   15d
volsync-rsync-dst-dd-io-pvc-6-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.81.47"]     6d16h
volsync-rsync-dst-dd-io-pvc-7-busybox-workloads-2-vmware-dccp-one   ClusterSetIP   ["172.30.58.224"]    15d
volsync-rsync-dst-dd-io-pvc-7-busybox-workloads-3-vmware-dccp-one   ClusterSetIP   ["172.30.244.154"]   6d16h


This is list of endpointslices:

kubectl get endpointslices -n submariner-broker |grep vmware-dccp-one
volsync-rsync-dst-dd-io-pvc-1-vmware-dccp-one   IPv4          8022      10.131.1.64    15d
volsync-rsync-dst-dd-io-pvc-2-vmware-dccp-one   IPv4          8022      10.131.1.45    15d
volsync-rsync-dst-dd-io-pvc-3-vmware-dccp-one   IPv4          8022      10.131.1.70    15d
volsync-rsync-dst-dd-io-pvc-4-vmware-dccp-one   IPv4          8022      10.128.3.194   15d
volsync-rsync-dst-dd-io-pvc-5-vmware-dccp-one   IPv4          8022      10.131.1.48    15d
volsync-rsync-dst-dd-io-pvc-6-vmware-dccp-one   IPv4          8022      10.129.2.138   15d
volsync-rsync-dst-dd-io-pvc-7-vmware-dccp-one   IPv4          8022      10.131.1.49    15d

This causes endpointslice information on dst cluster to flip. In lghthouse Coredns we also use the namespace information when replying to queries. Depending on which endpointslice is currently synced from broker, queries can fail.

Not familiar enough with sync/replication solution to give hypothesis for why failure is not too frequent.

A workaround to try for now would be to avoid using same servicename across namespaces. If this workaround works, it will also confirm the issue and we can work for getting the fix into ACM 2.7. Currently fix is only in 0.15.0 and won't land until 2.8.

Comment 13 krishnaram Karthick 2023-04-05 03:44:29 UTC
requesting "requires_doc_text" as the fix won't land in 4.13 timeframe.