Bug 2253043 - [Tracker ACM-9159] [RDR] [Hub recovery] [CephFS] Data sync stopped for multiple workloads after one of the managed clusters which went down was restored
Summary: [Tracker ACM-9159] [RDR] [Hub recovery] [CephFS] Data sync stopped for multip...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.16.0
Assignee: Benamar Mekhissi
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-12-05 18:48 UTC by Aman Agrawal
Modified: 2024-07-17 13:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-07-17 13:11:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ACM-9159 0 None None None 2024-02-01 14:06:12 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:11:46 UTC

Description Aman Agrawal 2023-12-05 18:48:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-30-174049
ACM 2.9.0 GA'ed (from OperatorHub)
ODF 4.14.1-15
ceph version 17.2.6-161.el9cp (7037a43d8f7fa86659a0575b566ec10080df0d71) quincy (stable)
Submariner 0.16.2
VolSync 0.8.0


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

*** Active hub cluster being at netural site *** 

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster).
3. Ensure that we have all the workloads in distict states like deployed, failedover, relocated etc.
4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime etc.
5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully imported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.
They seem to have retained their last state which was backedup. So everything is fine so far. Label cluster-monitoring on the hub cluster so that VolumeSync.DelayAlert is fired if data sync is affected for any workload.
7. Let IOs continue and check lastGroupSyncTime and VolumeSync.DelayAlert alert. sync for rbd based workloads were progressing just fine along with other cephfs backed workloads except appset-cephfs-busybox9-placement-drpc in NS busybox-workloads-9 which is being tracked by BZ2252756.
8. Then brought C1 managed cluster down after a few hours and performed failover on rbd backed workloads busybox4,5,6,1,17,2. After failover was completed, brought C1 managed cluster up (was down for 2-3hrs) and waited for the cleanup to complete.
9. After cleanup was successful, it was found that data sync stopped for other cephfs backed workloads such as busybox14,10,20,12,16,19.

However, subctl verify connectivity check passed in context to both the managed clusters.

This issue is being discussed with submariner team in- https://redhat-internal.slack.com/archives/C0134E73VH6/p1698327322786069

(Older hub remains down forever and is completely unreachable).


Actual results: 

amagrawa:acm$ drpc|grep cephfs
busybox-workloads-10   sub-cephfs-busybox10-placement-1-drpc    2d5h   amagrawa-2-1d                        Relocate       Relocated      Completed     2023-12-03T13:37:18Z   1.339293394s         True
busybox-workloads-11   sub-cephfs-busybox11-placement-1-drpc    2d5h   amagrawa-1-1d      amagrawa-2-1d     Failover       FailedOver     Completed                                                 True
busybox-workloads-12   sub-cephfs-busybox12-placement-1-drpc    2d5h   amagrawa-1-1d                                       Deployed       Completed                                                 True
busybox-workloads-16   sub-cephfs-busybox16-placement-1-drpc    2d5h   amagrawa-2-1d                                       Deployed       Completed                                                 True
busybox-workloads-20   sub-cephfs-busybox20-placement-1-drpc    30h    amagrawa-2-1d                                       Deployed       Completed                                                 True
openshift-gitops       appset-cephfs-busybox14-placement-drpc   2d5h   amagrawa-2-1d                                       Deployed       Completed                                                 True
openshift-gitops       appset-cephfs-busybox19-placement-drpc   30h    amagrawa-1-1d                                       Deployed       Completed                                                 True
openshift-gitops       appset-cephfs-busybox7-placement-drpc    2d5h   amagrawa-1-1d      amagrawa-2-1d     Failover       FailedOver     Completed                                                 True
openshift-gitops       appset-cephfs-busybox8-placement-drpc    2d5h   amagrawa-2-1d                        Relocate       Relocated      Completed                                                 True
openshift-gitops       appset-cephfs-busybox9-placement-drpc    2d5h   amagrawa-1-1d                                       Deployed       Completed                                                 True


amagrawa:acm$ group
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-10
    namespace: busybox-workloads-10
      namespace: busybox-workloads-10
    lastGroupSyncTime: "2023-12-05T10:43:47Z"
        namespace: busybox-workloads-10
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-11
    namespace: busybox-workloads-11
      namespace: busybox-workloads-11
    lastGroupSyncTime: "2023-12-05T18:22:46Z"
        namespace: busybox-workloads-11
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-12
    namespace: busybox-workloads-12
      namespace: busybox-workloads-12
    lastGroupSyncTime: "2023-12-05T10:52:36Z"
        namespace: busybox-workloads-12
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-15
    namespace: busybox-workloads-15
      namespace: busybox-workloads-15
    lastGroupSyncTime: "2023-12-05T18:20:02Z"
        namespace: busybox-workloads-15
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-16
    namespace: busybox-workloads-16
      namespace: busybox-workloads-16
    lastGroupSyncTime: "2023-12-05T10:52:39Z"
        namespace: busybox-workloads-16
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-18
    namespace: busybox-workloads-18
      namespace: busybox-workloads-18
    lastGroupSyncTime: "2023-12-05T18:20:03Z"
        namespace: busybox-workloads-18
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-20
    namespace: busybox-workloads-20
      namespace: busybox-workloads-20
    lastGroupSyncTime: "2023-12-05T10:41:23Z"
        namespace: busybox-workloads-20
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4
    namespace: busybox-workloads-4
      namespace: busybox-workloads-4
    lastGroupSyncTime: "2023-12-05T18:20:03Z"
        namespace: busybox-workloads-4
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-5
    namespace: busybox-workloads-5
      namespace: busybox-workloads-5
    lastGroupSyncTime: "2023-12-05T18:20:00Z"
        namespace: busybox-workloads-5
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-6
    namespace: busybox-workloads-6
      namespace: busybox-workloads-6
    lastGroupSyncTime: "2023-12-05T18:00:00Z"
        namespace: busybox-workloads-6
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-14
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T10:43:52Z"
        namespace: busybox-workloads-14
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-19
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T10:50:49Z"
        namespace: busybox-workloads-19
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-7
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T18:22:40Z"
        namespace: busybox-workloads-7
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-8
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T18:23:20Z"
        namespace: busybox-workloads-8
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-9
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-03T14:21:08Z"
        namespace: busybox-workloads-9
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T18:20:04Z"
        namespace: busybox-workloads-1
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-13
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T18:20:05Z"
        namespace: busybox-workloads-13
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-17
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T18:00:00Z"
        namespace: busybox-workloads-17
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T18:20:00Z"
        namespace: busybox-workloads-2
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2023-12-05T18:20:00Z"
        namespace: busybox-workloads-3
amagrawa:acm$ date -u
Tuesday 05 December 2023 06:28:42 PM UTC

From C1-
amagrawa$ oc get pods -A | grep volsync-rsync-tls-src

found list of running pods and then
amagrawa$ oc logs pod/volsync-rsync-tls-src-busybox-pvc-2-6zds7 -n busybox-workloads-9
2023.12.05 18:42:18 LOG3[8]: No remote host resolved
2023.12.05 18:42:18 LOG5[8]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
2023.12.05 18:42:18 LOG7[8]: Local descriptor (FD=3) closed
2023.12.05 18:42:18 LOG7[8]: Service [rsync] finished (0 left)
rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3]
2023.12.05 18:42:18 LOG7[main]: Found 1 ready file descriptor(s)
2023.12.05 18:42:18 LOG7[main]: FD=4 events=0x2001 revents=0x0
2023.12.05 18:42:18 LOG7[main]: FD=8 events=0x2001 revents=0x1
2023.12.05 18:42:18 LOG7[main]: Service [rsync] accepted (FD=3) from 127.0.0.1:51248
2023.12.05 18:42:18 LOG7[9]: Service [rsync] started
2023.12.05 18:42:18 LOG7[9]: Setting local socket options (FD=3)
2023.12.05 18:42:18 LOG7[9]: Option SO_KEEPALIVE set on local socket
2023.12.05 18:42:18 LOG7[9]: Option TCP_KEEPIDLE set on local socket
2023.12.05 18:42:18 LOG7[9]: Option TCP_NODELAY set on local socket
2023.12.05 18:42:18 LOG5[9]: Service [rsync] accepted connection from 127.0.0.1:51248
2023.12.05 18:42:18 LOG3[9]: Error resolving "volsync-rsync-tls-dst-busybox-pvc-2.busybox-workloads-9.svc.clusterset.local": Neither nodename nor servname known (EAI_NONAME)
2023.12.05 18:42:18 LOG3[9]: No remote host resolved
2023.12.05 18:42:18 LOG5[9]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
2023.12.05 18:42:18 LOG7[9]: Local descriptor (FD=3) closed
2023.12.05 18:42:18 LOG7[9]: Service [rsync] finished (0 left)
rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3]
Syncronization failed. Retrying in 32 seconds. Retry 5/5.



amagrawa$ oc logs pod/volsync-rsync-tls-src-busybox-pvc-4-2z7st -n busybox-workloads-9
2023.12.05 18:44:10 LOG3[8]: No remote host resolved
2023.12.05 18:44:10 LOG5[8]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
2023.12.05 18:44:10 LOG7[8]: Local descriptor (FD=3) closed
2023.12.05 18:44:10 LOG7[8]: Service [rsync] finished (0 left)
2023.12.05 18:44:10 LOG7[main]: Found 1 ready file descriptor(s)
2023.12.05 18:44:10 LOG7[main]: FD=4 events=0x2001 revents=0x0
2023.12.05 18:44:10 LOG7[main]: FD=8 events=0x2001 revents=0x1
2023.12.05 18:44:10 LOG7[main]: Service [rsync] accepted (FD=3) from 127.0.0.1:60122
rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3]
2023.12.05 18:44:10 LOG7[9]: Service [rsync] started
2023.12.05 18:44:10 LOG7[9]: Setting local socket options (FD=3)
2023.12.05 18:44:10 LOG7[9]: Option SO_KEEPALIVE set on local socket
2023.12.05 18:44:10 LOG7[9]: Option TCP_KEEPIDLE set on local socket
2023.12.05 18:44:10 LOG7[9]: Option TCP_NODELAY set on local socket
2023.12.05 18:44:10 LOG5[9]: Service [rsync] accepted connection from 127.0.0.1:60122
2023.12.05 18:44:10 LOG3[9]: Error resolving "volsync-rsync-tls-dst-busybox-pvc-4.busybox-workloads-9.svc.clusterset.local": Neither nodename nor servname known (EAI_NONAME)
2023.12.05 18:44:10 LOG3[9]: No remote host resolved
2023.12.05 18:44:10 LOG5[9]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
2023.12.05 18:44:10 LOG7[9]: Local descriptor (FD=3) closed
2023.12.05 18:44:10 LOG7[9]: Service [rsync] finished (0 left)
rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3]
Syncronization failed. Retrying in 32 seconds. Retry 5/5.
rsync completed in 68s
Synchronization failed. rsync returned: 1212
2023.12.05 18:44:42 LOG7[main]: Found 1 ready file descriptor(s)
2023.12.05 18:44:42 LOG7[main]: FD=4 events=0x2001 revents=0x1
2023.12.05 18:44:42 LOG7[main]: FD=8 events=0x2001 revents=0x0
2023.12.05 18:44:42 LOG7[main]: Dispatching a signal from the signal pipe
2023.12.05 18:44:42 LOG7[main]: Processing SIGNAL_TERMINATE
2023.12.05 18:44:42 LOG5[main]: Terminated
2023.12.05 18:44:42 LOG7[main]: Leak detection table utilization: 13/997, 1.30%
2023.12.05 18:44:42 LOG7[main]: Removed pid file /tmp/stunnel-client.pid
2023.12.05 18:44:42 LOG7[main]: Terminating the cron thread
2023.12.05 18:44:42 LOG5[main]: Terminating 1 service thread(s)



From C2-

amagrawa:~$ oc logs  pod/volsync-rsync-tls-src-busybox-pvc-3-wb2sv -n busybox-workloads-10
2023.12.05 18:45:25 LOG3[8]: No remote host resolved
2023.12.05 18:45:25 LOG5[8]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
2023.12.05 18:45:25 LOG7[8]: Local descriptor (FD=3) closed
2023.12.05 18:45:25 LOG7[8]: Service [rsync] finished (0 left)
rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3]
2023.12.05 18:45:25 LOG7[main]: Found 1 ready file descriptor(s)
2023.12.05 18:45:25 LOG7[main]: FD=4 events=0x2001 revents=0x0
2023.12.05 18:45:25 LOG7[main]: FD=8 events=0x2001 revents=0x1
2023.12.05 18:45:25 LOG7[main]: Service [rsync] accepted (FD=3) from 127.0.0.1:35110
2023.12.05 18:45:25 LOG7[9]: Service [rsync] started
2023.12.05 18:45:25 LOG7[9]: Setting local socket options (FD=3)
2023.12.05 18:45:25 LOG7[9]: Option SO_KEEPALIVE set on local socket
2023.12.05 18:45:25 LOG7[9]: Option TCP_KEEPIDLE set on local socket
2023.12.05 18:45:25 LOG7[9]: Option TCP_NODELAY set on local socket
2023.12.05 18:45:25 LOG5[9]: Service [rsync] accepted connection from 127.0.0.1:35110
2023.12.05 18:45:25 LOG3[9]: Error resolving "volsync-rsync-tls-dst-busybox-pvc-3.busybox-workloads-10.svc.clusterset.local": Neither nodename nor servname known (EAI_NONAME)
2023.12.05 18:45:25 LOG3[9]: No remote host resolved
2023.12.05 18:45:25 LOG5[9]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
2023.12.05 18:45:25 LOG7[9]: Local descriptor (FD=3) closed
2023.12.05 18:45:25 LOG7[9]: Service [rsync] finished (0 left)
rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.3]
Syncronization failed. Retrying in 32 seconds. Retry 5/5.
rsync completed in 67s
Synchronization failed. rsync returned: 1212
2023.12.05 18:45:57 LOG7[main]: Found 1 ready file descriptor(s)
2023.12.05 18:45:57 LOG7[main]: FD=4 events=0x2001 revents=0x1
2023.12.05 18:45:57 LOG7[main]: FD=8 events=0x2001 revents=0x0
2023.12.05 18:45:57 LOG7[main]: Dispatching a signal from the signal pipe
2023.12.05 18:45:57 LOG7[main]: Processing SIGNAL_TERMINATE
2023.12.05 18:45:57 LOG5[main]: Terminated
2023.12.05 18:45:57 LOG7[main]: Leak detection table utilization: 13/997, 1.30%
2023.12.05 18:45:57 LOG7[main]: Removed pid file /tmp/stunnel-client.pid
2023.12.05 18:45:57 LOG7[main]: Terminating the cron thread
2023.12.05 18:45:57 LOG5[main]: Terminating 1 service thread(s)


Expected results: Data sync for cephfs backed workloads shouldn't stop after the managed cluster which was down is successfully restored.


Additional info:

Comment 9 Mudit Agarwal 2024-01-30 09:52:45 UTC
Karolin, please add trackers whenever available.

Comment 21 Aman Agrawal 2024-05-30 19:38:44 UTC
Tested with following versions:

ceph version 18.2.1-188.el9cp (b1ae9c989e2f41dcfec0e680c11d1d9465b1db0e) reef (stable)
OCP 4.16.0-0.nightly-2024-05-23-173505
ACM 2.11.0-DOWNSTREAM-2024-05-23-15-16-26
MCE 2.6.0-104 
ODF 4.16.0-108.stable
Gitops v1.12.3 
Submariner 0.18.0

Platform- VMware

When the down managed cluster is recovered, data sync for cephfs workloads running on the down cluster resumes successfully which matches the expected results after performing steps to reproduce. 

Cleanup and data sync also resumes successfully for all the failedover workloads.

Fix provided in ACM-9159 LGTM. Marking this bug as verified.

Comment 24 errata-xmlrpc 2024-07-17 13:11:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.