Bug 2166354
| Summary: | [RDR][CEPHFS][Tracker] sync/replication is getting stopped for some pvc rsync: connection unexpectedly closed | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pratik Surve <prsurve> |
| Component: | odf-dr | Assignee: | Vishal Thapar <vthapar> |
| odf-dr sub component: | ramen | QA Contact: | Pratik Surve <prsurve> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | amagrawa, bmekhiss, kramdoss, kseeger, muagarwa, nyechiel, odf-bz-bot, rtalur, sheggodu, srangana, vthapar |
| Version: | 4.12 | Flags: | vthapar:
needinfo-
vthapar: needinfo- |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.14.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-11-08 18:49:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2244409 | ||
|
Description
Pratik Surve
2023-02-01 14:50:04 UTC
Issue root caused to be same as https://github.com/submariner-io/lighthouse/pull/964 This is list of serviceimports on broker: kubectl get serviceimports -n submariner-broker |grep vmware-dccp-one volsync-rsync-dst-dd-io-pvc-1-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.223.46"] 15d volsync-rsync-dst-dd-io-pvc-1-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.159.0"] 6d16h volsync-rsync-dst-dd-io-pvc-2-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.193.46"] 15d volsync-rsync-dst-dd-io-pvc-2-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.221.179"] 6d16h volsync-rsync-dst-dd-io-pvc-3-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.234.101"] 15d volsync-rsync-dst-dd-io-pvc-3-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.214.203"] 6d16h volsync-rsync-dst-dd-io-pvc-4-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.129.69"] 15d volsync-rsync-dst-dd-io-pvc-4-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.117.108"] 6d16h volsync-rsync-dst-dd-io-pvc-5-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.169.148"] 15d volsync-rsync-dst-dd-io-pvc-5-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.18.31"] 6d16h volsync-rsync-dst-dd-io-pvc-6-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.199.117"] 15d volsync-rsync-dst-dd-io-pvc-6-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.81.47"] 6d16h volsync-rsync-dst-dd-io-pvc-7-busybox-workloads-2-vmware-dccp-one ClusterSetIP ["172.30.58.224"] 15d volsync-rsync-dst-dd-io-pvc-7-busybox-workloads-3-vmware-dccp-one ClusterSetIP ["172.30.244.154"] 6d16h This is list of endpointslices: kubectl get endpointslices -n submariner-broker |grep vmware-dccp-one volsync-rsync-dst-dd-io-pvc-1-vmware-dccp-one IPv4 8022 10.131.1.64 15d volsync-rsync-dst-dd-io-pvc-2-vmware-dccp-one IPv4 8022 10.131.1.45 15d volsync-rsync-dst-dd-io-pvc-3-vmware-dccp-one IPv4 8022 10.131.1.70 15d volsync-rsync-dst-dd-io-pvc-4-vmware-dccp-one IPv4 8022 10.128.3.194 15d volsync-rsync-dst-dd-io-pvc-5-vmware-dccp-one IPv4 8022 10.131.1.48 15d volsync-rsync-dst-dd-io-pvc-6-vmware-dccp-one IPv4 8022 10.129.2.138 15d volsync-rsync-dst-dd-io-pvc-7-vmware-dccp-one IPv4 8022 10.131.1.49 15d This causes endpointslice information on dst cluster to flip. In lghthouse Coredns we also use the namespace information when replying to queries. Depending on which endpointslice is currently synced from broker, queries can fail. Not familiar enough with sync/replication solution to give hypothesis for why failure is not too frequent. A workaround to try for now would be to avoid using same servicename across namespaces. If this workaround works, it will also confirm the issue and we can work for getting the fix into ACM 2.7. Currently fix is only in 0.15.0 and won't land until 2.8. requesting "requires_doc_text" as the fix won't land in 4.13 timeframe. Fix available in Submariner 0.16.0 which will be bundled with ACM 2.9 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |