Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): OCP 4.14.0-0.nightly-2023-11-30-174049 ACM 2.9.0 GA'ed (from OperatorHub) ODF 4.14.1-15 ceph version 17.2.6-161.el9cp (7037a43d8f7fa86659a0575b566ec10080df0d71) quincy (stable) Submariner 0.16.2 VolSync 0.8.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: *** Active hub cluster being at netural site *** 1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types. 2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster). 3. Ensure that we have all the workloads in distict states like deployed, failedover, relocated etc. 4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime etc. 5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully imported, drpolicy gets validated. 6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not. They seem to have retained their last state which was backedup. So everything is fine so far. Label cluster-monitoring on the hub cluster so that VolumeSync.DelayAlert is fired if data sync is affected for any workload. 7. Let IOs continue and check lastGroupSyncTime and VolumeSync.DelayAlert alert. sync for rbd based workloads were progressing just fine along with other cephfs backed workloads except appset-cephfs-busybox9-placement-drpc in NS busybox-workloads-9 which is being tracked by BZ2252756. 8. Then brought C1 managed cluster down after a few hours and performed failover on rbd backed workloads busybox4,5,6,1,17,2. After failover was completed, brought C1 managed cluster up (was down for 2-3hrs) and waited for the cleanup to complete. 9. After cleanup was successful, it was found that data sync stopped for other cephfs backed workloads such as busybox14,10,20,12,16,19. However, subctl verify connectivity check passed in context to both the managed clusters. This issue is being discussed with submariner team in- https://redhat-internal.slack.com/archives/C0134E73VH6/p1698327322786069 (Older hub remains down forever and is completely unreachable). Actual results: amagrawa:acm$ drpc|grep cephfs busybox-workloads-10 sub-cephfs-busybox10-placement-1-drpc 2d5h amagrawa-2-1d Relocate Relocated Completed 2023-12-03T13:37:18Z 1.339293394s True busybox-workloads-11 sub-cephfs-busybox11-placement-1-drpc 2d5h amagrawa-1-1d amagrawa-2-1d Failover FailedOver Completed True busybox-workloads-12 sub-cephfs-busybox12-placement-1-drpc 2d5h amagrawa-1-1d Deployed Completed True busybox-workloads-16 sub-cephfs-busybox16-placement-1-drpc 2d5h amagrawa-2-1d Deployed Completed True busybox-workloads-20 sub-cephfs-busybox20-placement-1-drpc 30h amagrawa-2-1d Deployed Completed True openshift-gitops appset-cephfs-busybox14-placement-drpc 2d5h amagrawa-2-1d Deployed Completed True openshift-gitops appset-cephfs-busybox19-placement-drpc 30h amagrawa-1-1d Deployed Completed True openshift-gitops appset-cephfs-busybox7-placement-drpc 2d5h amagrawa-1-1d amagrawa-2-1d Failover FailedOver Completed True openshift-gitops appset-cephfs-busybox8-placement-drpc 2d5h amagrawa-2-1d Relocate Relocated Completed True openshift-gitops appset-cephfs-busybox9-placement-drpc 2d5h amagrawa-1-1d Deployed Completed True amagrawa:acm$ group drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-10 namespace: busybox-workloads-10 namespace: busybox-workloads-10 lastGroupSyncTime: "2023-12-05T10:43:47Z" namespace: busybox-workloads-10 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-11 namespace: busybox-workloads-11 namespace: busybox-workloads-11 lastGroupSyncTime: "2023-12-05T18:22:46Z" namespace: busybox-workloads-11 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-12 namespace: busybox-workloads-12 namespace: busybox-workloads-12 lastGroupSyncTime: "2023-12-05T10:52:36Z" namespace: busybox-workloads-12 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-15 namespace: busybox-workloads-15 namespace: busybox-workloads-15 lastGroupSyncTime: "2023-12-05T18:20:02Z" namespace: busybox-workloads-15 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-16 namespace: busybox-workloads-16 namespace: busybox-workloads-16 lastGroupSyncTime: "2023-12-05T10:52:39Z" namespace: busybox-workloads-16 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-18 namespace: busybox-workloads-18 namespace: busybox-workloads-18 lastGroupSyncTime: "2023-12-05T18:20:03Z" namespace: busybox-workloads-18 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-20 namespace: busybox-workloads-20 namespace: busybox-workloads-20 lastGroupSyncTime: "2023-12-05T10:41:23Z" namespace: busybox-workloads-20 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4 namespace: busybox-workloads-4 namespace: busybox-workloads-4 lastGroupSyncTime: "2023-12-05T18:20:03Z" namespace: busybox-workloads-4 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-5 namespace: busybox-workloads-5 namespace: busybox-workloads-5 lastGroupSyncTime: "2023-12-05T18:20:00Z" namespace: busybox-workloads-5 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-6 namespace: busybox-workloads-6 namespace: busybox-workloads-6 lastGroupSyncTime: "2023-12-05T18:00:00Z" namespace: busybox-workloads-6 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-14 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T10:43:52Z" namespace: busybox-workloads-14 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-19 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T10:50:49Z" namespace: busybox-workloads-19 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-7 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T18:22:40Z" namespace: busybox-workloads-7 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-8 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T18:23:20Z" namespace: busybox-workloads-8 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-9 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-03T14:21:08Z" namespace: busybox-workloads-9 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T18:20:04Z" namespace: busybox-workloads-1 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-13 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T18:20:05Z" namespace: busybox-workloads-13 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-17 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T18:00:00Z" namespace: busybox-workloads-17 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T18:20:00Z" namespace: busybox-workloads-2 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-12-05T18:20:00Z" namespace: busybox-workloads-3 amagrawa:acm$ date -u Tuesday 05 December 2023 06:28:42 PM UTC From C1- amagrawa$ oc get pods -A | grep volsync-rsync-tls-src found list of running pods and then amagrawa$ oc logs pod/volsync-rsync-tls-src-busybox-pvc-2-6zds7 -n busybox-workloads-9 2023.12.05 18:42:18 LOG3[8]: No remote host resolved 2023.12.05 18:42:18 LOG5[8]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket 2023.12.05 18:42:18 LOG7[8]: Local descriptor (FD=3) closed 2023.12.05 18:42:18 LOG7[8]: Service [rsync] finished (0 left) rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104) rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3] 2023.12.05 18:42:18 LOG7[main]: Found 1 ready file descriptor(s) 2023.12.05 18:42:18 LOG7[main]: FD=4 events=0x2001 revents=0x0 2023.12.05 18:42:18 LOG7[main]: FD=8 events=0x2001 revents=0x1 2023.12.05 18:42:18 LOG7[main]: Service [rsync] accepted (FD=3) from 127.0.0.1:51248 2023.12.05 18:42:18 LOG7[9]: Service [rsync] started 2023.12.05 18:42:18 LOG7[9]: Setting local socket options (FD=3) 2023.12.05 18:42:18 LOG7[9]: Option SO_KEEPALIVE set on local socket 2023.12.05 18:42:18 LOG7[9]: Option TCP_KEEPIDLE set on local socket 2023.12.05 18:42:18 LOG7[9]: Option TCP_NODELAY set on local socket 2023.12.05 18:42:18 LOG5[9]: Service [rsync] accepted connection from 127.0.0.1:51248 2023.12.05 18:42:18 LOG3[9]: Error resolving "volsync-rsync-tls-dst-busybox-pvc-2.busybox-workloads-9.svc.clusterset.local": Neither nodename nor servname known (EAI_NONAME) 2023.12.05 18:42:18 LOG3[9]: No remote host resolved 2023.12.05 18:42:18 LOG5[9]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket 2023.12.05 18:42:18 LOG7[9]: Local descriptor (FD=3) closed 2023.12.05 18:42:18 LOG7[9]: Service [rsync] finished (0 left) rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104) rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3] Syncronization failed. Retrying in 32 seconds. Retry 5/5. amagrawa$ oc logs pod/volsync-rsync-tls-src-busybox-pvc-4-2z7st -n busybox-workloads-9 2023.12.05 18:44:10 LOG3[8]: No remote host resolved 2023.12.05 18:44:10 LOG5[8]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket 2023.12.05 18:44:10 LOG7[8]: Local descriptor (FD=3) closed 2023.12.05 18:44:10 LOG7[8]: Service [rsync] finished (0 left) 2023.12.05 18:44:10 LOG7[main]: Found 1 ready file descriptor(s) 2023.12.05 18:44:10 LOG7[main]: FD=4 events=0x2001 revents=0x0 2023.12.05 18:44:10 LOG7[main]: FD=8 events=0x2001 revents=0x1 2023.12.05 18:44:10 LOG7[main]: Service [rsync] accepted (FD=3) from 127.0.0.1:60122 rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104) rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3] 2023.12.05 18:44:10 LOG7[9]: Service [rsync] started 2023.12.05 18:44:10 LOG7[9]: Setting local socket options (FD=3) 2023.12.05 18:44:10 LOG7[9]: Option SO_KEEPALIVE set on local socket 2023.12.05 18:44:10 LOG7[9]: Option TCP_KEEPIDLE set on local socket 2023.12.05 18:44:10 LOG7[9]: Option TCP_NODELAY set on local socket 2023.12.05 18:44:10 LOG5[9]: Service [rsync] accepted connection from 127.0.0.1:60122 2023.12.05 18:44:10 LOG3[9]: Error resolving "volsync-rsync-tls-dst-busybox-pvc-4.busybox-workloads-9.svc.clusterset.local": Neither nodename nor servname known (EAI_NONAME) 2023.12.05 18:44:10 LOG3[9]: No remote host resolved 2023.12.05 18:44:10 LOG5[9]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket 2023.12.05 18:44:10 LOG7[9]: Local descriptor (FD=3) closed 2023.12.05 18:44:10 LOG7[9]: Service [rsync] finished (0 left) rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104) rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3] Syncronization failed. Retrying in 32 seconds. Retry 5/5. rsync completed in 68s Synchronization failed. rsync returned: 1212 2023.12.05 18:44:42 LOG7[main]: Found 1 ready file descriptor(s) 2023.12.05 18:44:42 LOG7[main]: FD=4 events=0x2001 revents=0x1 2023.12.05 18:44:42 LOG7[main]: FD=8 events=0x2001 revents=0x0 2023.12.05 18:44:42 LOG7[main]: Dispatching a signal from the signal pipe 2023.12.05 18:44:42 LOG7[main]: Processing SIGNAL_TERMINATE 2023.12.05 18:44:42 LOG5[main]: Terminated 2023.12.05 18:44:42 LOG7[main]: Leak detection table utilization: 13/997, 1.30% 2023.12.05 18:44:42 LOG7[main]: Removed pid file /tmp/stunnel-client.pid 2023.12.05 18:44:42 LOG7[main]: Terminating the cron thread 2023.12.05 18:44:42 LOG5[main]: Terminating 1 service thread(s) From C2- amagrawa:~$ oc logs pod/volsync-rsync-tls-src-busybox-pvc-3-wb2sv -n busybox-workloads-10 2023.12.05 18:45:25 LOG3[8]: No remote host resolved 2023.12.05 18:45:25 LOG5[8]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket 2023.12.05 18:45:25 LOG7[8]: Local descriptor (FD=3) closed 2023.12.05 18:45:25 LOG7[8]: Service [rsync] finished (0 left) rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104) rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3] 2023.12.05 18:45:25 LOG7[main]: Found 1 ready file descriptor(s) 2023.12.05 18:45:25 LOG7[main]: FD=4 events=0x2001 revents=0x0 2023.12.05 18:45:25 LOG7[main]: FD=8 events=0x2001 revents=0x1 2023.12.05 18:45:25 LOG7[main]: Service [rsync] accepted (FD=3) from 127.0.0.1:35110 2023.12.05 18:45:25 LOG7[9]: Service [rsync] started 2023.12.05 18:45:25 LOG7[9]: Setting local socket options (FD=3) 2023.12.05 18:45:25 LOG7[9]: Option SO_KEEPALIVE set on local socket 2023.12.05 18:45:25 LOG7[9]: Option TCP_KEEPIDLE set on local socket 2023.12.05 18:45:25 LOG7[9]: Option TCP_NODELAY set on local socket 2023.12.05 18:45:25 LOG5[9]: Service [rsync] accepted connection from 127.0.0.1:35110 2023.12.05 18:45:25 LOG3[9]: Error resolving "volsync-rsync-tls-dst-busybox-pvc-3.busybox-workloads-10.svc.clusterset.local": Neither nodename nor servname known (EAI_NONAME) 2023.12.05 18:45:25 LOG3[9]: No remote host resolved 2023.12.05 18:45:25 LOG5[9]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket 2023.12.05 18:45:25 LOG7[9]: Local descriptor (FD=3) closed 2023.12.05 18:45:25 LOG7[9]: Service [rsync] finished (0 left) rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104) rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3] Syncronization failed. Retrying in 32 seconds. Retry 5/5. rsync completed in 67s Synchronization failed. rsync returned: 1212 2023.12.05 18:45:57 LOG7[main]: Found 1 ready file descriptor(s) 2023.12.05 18:45:57 LOG7[main]: FD=4 events=0x2001 revents=0x1 2023.12.05 18:45:57 LOG7[main]: FD=8 events=0x2001 revents=0x0 2023.12.05 18:45:57 LOG7[main]: Dispatching a signal from the signal pipe 2023.12.05 18:45:57 LOG7[main]: Processing SIGNAL_TERMINATE 2023.12.05 18:45:57 LOG5[main]: Terminated 2023.12.05 18:45:57 LOG7[main]: Leak detection table utilization: 13/997, 1.30% 2023.12.05 18:45:57 LOG7[main]: Removed pid file /tmp/stunnel-client.pid 2023.12.05 18:45:57 LOG7[main]: Terminating the cron thread 2023.12.05 18:45:57 LOG5[main]: Terminating 1 service thread(s) Expected results: Data sync for cephfs backed workloads shouldn't stop after the managed cluster which was down is successfully restored. Additional info:
Karolin, please add trackers whenever available.
Tested with following versions: ceph version 18.2.1-188.el9cp (b1ae9c989e2f41dcfec0e680c11d1d9465b1db0e) reef (stable) OCP 4.16.0-0.nightly-2024-05-23-173505 ACM 2.11.0-DOWNSTREAM-2024-05-23-15-16-26 MCE 2.6.0-104 ODF 4.16.0-108.stable Gitops v1.12.3 Submariner 0.18.0 Platform- VMware When the down managed cluster is recovered, data sync for cephfs workloads running on the down cluster resumes successfully which matches the expected results after performing steps to reproduce. Cleanup and data sync also resumes successfully for all the failedover workloads. Fix provided in ACM-9159 LGTM. Marking this bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591