Description of problem (please be detailed as possible and provide log snippests): [RDR][CEPHFS] sync/replication is getting stopped for some pvc rsync: connection unexpectedly closed Version of all relevant components (if applicable): OCP version:- 4.12.0-0.nightly-2023-01-19-110743 ODF version:- 4.12.0-167 CEPH version:- ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable) ACM version:- v2.7.0 SUBMARINER version:- v0.14.1 VOLSYNC version:- volsync-product.v0.6.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy RDR cluster 2.Create CEPHFS workload 3.check replicationsource status after 4-5 days Actual results: oc get replicationsources --all-namespaces NAMESPACE NAME SOURCE LAST SYNC DURATION NEXT SYNC busybox-workloads-1 busybox-pvc-1 busybox-pvc-1 2023-01-31T18:29:42Z 4m42.43785306s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-10 busybox-pvc-10 2023-01-31T18:29:44Z 4m44.270959207s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-11 busybox-pvc-11 2023-01-31T18:29:39Z 4m39.561215075s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-12 busybox-pvc-12 2023-01-31T18:29:39Z 4m39.505161496s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-13 busybox-pvc-13 2023-01-31T18:29:39Z 4m39.392102602s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-14 busybox-pvc-14 2023-01-29T09:23:46Z 3m46.2123443s 2023-01-29T09:25:00Z busybox-workloads-1 busybox-pvc-15 busybox-pvc-15 2023-01-31T18:29:38Z 4m38.684680219s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-16 busybox-pvc-16 2023-01-31T18:29:44Z 4m44.940480011s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-17 busybox-pvc-17 2023-01-31T18:29:46Z 4m46.939628275s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-18 busybox-pvc-18 2023-01-31T18:29:49Z 4m49.386471097s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-19 busybox-pvc-19 2023-01-31T18:29:46Z 4m46.9576977s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-2 busybox-pvc-2 2023-01-31T18:29:44Z 4m44.210365128s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-20 busybox-pvc-20 2023-01-31T18:29:20Z 4m20.176716233s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-3 busybox-pvc-3 2023-01-31T18:29:44Z 4m44.252856471s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-4 busybox-pvc-4 2023-01-31T18:29:42Z 4m42.463537101s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-5 busybox-pvc-5 2023-01-31T18:29:39Z 4m39.544973999s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-6 busybox-pvc-6 2023-01-31T18:29:41Z 4m41.734123311s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-7 busybox-pvc-7 2023-01-31T18:29:24Z 4m24.746735678s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-8 busybox-pvc-8 2023-01-31T18:29:23Z 4m23.763610806s 2023-01-31T18:30:00Z busybox-workloads-1 busybox-pvc-9 busybox-pvc-9 2023-01-31T18:29:24Z 4m24.778197239s 2023-01-31T18:30:00Z busybox-workloads-1 mysql-pv-claim mysql-pv-claim 2023-01-31T18:25:49Z 5m49.531404022s 2023-01-31T18:30:00Z $oc logs volsync-rsync-src-busybox-pvc-14-xjxj6 VolSync rsync container version: ACM-0.6.0-ce9a280 Syncing data to volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local:22 ... ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3] Syncronization failed. Retrying in 2 seconds. Retry 1/5. ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3] Syncronization failed. Retrying in 4 seconds. Retry 2/5. ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3] Syncronization failed. Retrying in 8 seconds. Retry 3/5. ssh: Could not resolve hostname volsync-rsync-dst-busybox-pvc-14.busybox-workloads-1.svc.clusterset.local: Name or service not known rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3] Syncronization failed. Retrying in 16 seconds. Retry 4/5. $replication is stop for volsync-rsync-dst-busybox-pvc-14-pz5jk 1/1 Running 0 2d9h Expected results: Additional info:
Issue root caused to be same as https://github.com/submariner-io/lighthouse/pull/964 This is list of serviceimports on broker: kubectl get serviceimports -n submariner-broker |grep vmware-dccp-one volsync-rsync-dst-dd-io-pvc-1-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.223.46"] 15d volsync-rsync-dst-dd-io-pvc-1-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.159.0"] 6d16h volsync-rsync-dst-dd-io-pvc-2-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.193.46"] 15d volsync-rsync-dst-dd-io-pvc-2-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.221.179"] 6d16h volsync-rsync-dst-dd-io-pvc-3-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.234.101"] 15d volsync-rsync-dst-dd-io-pvc-3-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.214.203"] 6d16h volsync-rsync-dst-dd-io-pvc-4-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.129.69"] 15d volsync-rsync-dst-dd-io-pvc-4-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.117.108"] 6d16h volsync-rsync-dst-dd-io-pvc-5-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.169.148"] 15d volsync-rsync-dst-dd-io-pvc-5-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.18.31"] 6d16h volsync-rsync-dst-dd-io-pvc-6-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.199.117"] 15d volsync-rsync-dst-dd-io-pvc-6-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.81.47"] 6d16h volsync-rsync-dst-dd-io-pvc-7-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.58.224"] 15d volsync-rsync-dst-dd-io-pvc-7-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.244.154"] 6d16h This is list of endpointslices: kubectl get endpointslices -n submariner-broker |grep vmware-dccp-one volsync-rsync-dst-dd-io-pvc-1-vmware-dccp-one IPv4 8022 10.131.1.64 15d volsync-rsync-dst-dd-io-pvc-2-vmware-dccp-one IPv4 8022 10.131.1.45 15d volsync-rsync-dst-dd-io-pvc-3-vmware-dccp-one IPv4 8022 10.131.1.70 15d volsync-rsync-dst-dd-io-pvc-4-vmware-dccp-one IPv4 8022 10.128.3.194 15d volsync-rsync-dst-dd-io-pvc-5-vmware-dccp-one IPv4 8022 10.131.1.48 15d volsync-rsync-dst-dd-io-pvc-6-vmware-dccp-one IPv4 8022 10.129.2.138 15d volsync-rsync-dst-dd-io-pvc-7-vmware-dccp-one IPv4 8022 10.131.1.49 15d This causes endpointslice information on dst cluster to flip. In lghthouse Coredns we also use the namespace information when replying to queries. Depending on which endpointslice is currently synced from broker, queries can fail. Not familiar enough with sync/replication solution to give hypothesis for why failure is not too frequent. A workaround to try for now would be to avoid using same servicename across namespaces. If this workaround works, it will also confirm the issue and we can work for getting the fix into ACM 2.7. Currently fix is only in 0.15.0 and won't land until 2.8.
requesting "requires_doc_text" as the fix won't land in 4.13 timeframe.
Fix available in Submariner 0.16.0 which will be bundled with ACM 2.9
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days