Bug 2266154

Summary: [RDR] Data replication stopped for most of the workloads
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: cephAssignee: Brad Hubbard <bhubbard>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: ASSIGNED --- Docs Contact:
Severity: high    
Priority: unspecified CC: bhubbard, bmekhiss, bniver, ebenahar, idryomov, khiremat, kramdoss, kseeger, muagarwa, neesingh, nojha, paarora, rzarzyns, sheggodu, sostapov, vshankar
Version: 4.15Flags: khiremat: needinfo-
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Aman Agrawal 2024-02-26 20:02:32 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.15.0-0.nightly-2024-02-16-235514
ODF v4.15.0-149.stable
ACM 2.10.0-DOWNSTREAM-2024-02-15-05-34-13
Submariner 0.17.0
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configured a Regional DR setup with DR protected workloads of types subscription and appset backed by both RBD and CephFS in all combinations. Ran IOs for 3-4 days and found that replication stopped for most of the workloads.

No failover/relocate action was performed on them.

Subctl verify connectivity check passed w.r.t both the managed clusters.

2.
3.


Actual results: Data replication stopped for most of the workloads

amagrawa:~$ drpc
NAMESPACE              NAME                                     AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-13   cephfs-sub-busybox13-placement-1-drpc    4d23h   amagrawa-m1                                         Deployed       Completed     2024-02-21T20:07:19Z   48.129206831s     True
busybox-workloads-14   cephfs-sub-busybox14-placement-1-drpc    4d23h   amagrawa-m1                                         Deployed       Completed     2024-02-21T20:08:50Z   32.138146869s     True
busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    4d23h   amagrawa-m1                                         Deployed       Completed     2024-02-21T20:09:59Z   49.128696954s     True
busybox-workloads-16   cephfs-sub-busybox16-placement-1-drpc    4d23h   amagrawa-m2                                         Deployed       Completed     2024-02-21T20:11:02Z   45.122431672s     True
busybox-workloads-5    rbd-sub-busybox5-placement-1-drpc        5d      amagrawa-m1                                         Deployed       Completed     2024-02-21T19:55:17Z   15.08242303s      True
busybox-workloads-6    rbd-sub-busybox6-placement-1-drpc        5d      amagrawa-m1                                         Deployed       Completed     2024-02-21T19:56:46Z   2.073870577s      True
busybox-workloads-7    rbd-sub-busybox7-placement-1-drpc        5d      amagrawa-m1                                         Deployed       Completed     2024-02-21T19:57:44Z   15.036975914s     True
busybox-workloads-8    rbd-sub-busybox8-placement-1-drpc        4d23h   amagrawa-m2                                         Deployed       Completed     2024-02-21T19:58:37Z   23.038867468s     True
openshift-gitops       cephfs-appset-busybox10-placement-drpc   4d23h   amagrawa-m1                                         Deployed       Completed     2024-02-21T20:03:32Z   46.149574197s     True
openshift-gitops       cephfs-appset-busybox11-placement-drpc   4d23h   amagrawa-m1                                         Deployed       Completed     2024-02-21T20:04:58Z   32.153456796s     True
openshift-gitops       cephfs-appset-busybox12-placement-drpc   4d23h   amagrawa-m2                                         Deployed       Completed     2024-02-21T20:05:59Z   35.134274567s     True
openshift-gitops       cephfs-appset-busybox9-placement-drpc    4d23h   amagrawa-m1                                         Deployed       Completed     2024-02-21T20:02:33Z   31.288294384s     True
openshift-gitops       rbd-appset-busybox1-placement-drpc       5d      amagrawa-m1                                         Deployed       Completed     2024-02-21T19:38:58Z   8m55.725779771s   True
openshift-gitops       rbd-appset-busybox2-placement-drpc       5d      amagrawa-m1                                         Deployed       Completed     2024-02-21T19:43:44Z   4m22.363511628s   True
openshift-gitops       rbd-appset-busybox3-placement-drpc       4d23h   amagrawa-m1                                         Deployed       Completed     2024-02-21T19:59:44Z   21.04601057s      True
openshift-gitops       rbd-appset-busybox4-placement-drpc       5d      amagrawa-m2                                         Deployed       Completed     2024-02-21T19:53:53Z   16.039184856s     True


amagrawa:~$ group
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-13
    namespace: busybox-workloads-13
      namespace: busybox-workloads-13
    lastGroupSyncTime: "2024-02-24T14:32:45Z"
        namespace: busybox-workloads-13
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-14
    namespace: busybox-workloads-14
      namespace: busybox-workloads-14
    lastGroupSyncTime: "2024-02-24T14:32:07Z"
        namespace: busybox-workloads-14
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-15
    namespace: busybox-workloads-15
      namespace: busybox-workloads-15
    lastGroupSyncTime: "2024-02-24T14:32:20Z"
        namespace: busybox-workloads-15
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-16
    namespace: busybox-workloads-16
      namespace: busybox-workloads-16
    lastGroupSyncTime: "2024-02-26T14:02:07Z"
        namespace: busybox-workloads-16
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-5
    namespace: busybox-workloads-5
      namespace: busybox-workloads-5
    lastGroupSyncTime: "2024-02-25T16:15:05Z"
        namespace: busybox-workloads-5
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-6
    namespace: busybox-workloads-6
      namespace: busybox-workloads-6
    lastGroupSyncTime: "2024-02-25T16:15:01Z"
        namespace: busybox-workloads-6
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-7
    namespace: busybox-workloads-7
      namespace: busybox-workloads-7
    lastGroupSyncTime: "2024-02-25T16:15:01Z"
        namespace: busybox-workloads-7
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-8
    namespace: busybox-workloads-8
      namespace: busybox-workloads-8
    lastGroupSyncTime: "2024-02-26T19:55:00Z"
        namespace: busybox-workloads-8
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-10
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2024-02-24T14:32:57Z"
        namespace: busybox-workloads-10
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-11
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2024-02-24T14:32:37Z"
        namespace: busybox-workloads-11
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-12
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2024-02-24T14:25:53Z"
        namespace: busybox-workloads-12
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-9
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2024-02-24T14:32:43Z"
        namespace: busybox-workloads-9
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2024-02-25T16:15:03Z"
        namespace: busybox-workloads-1
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2024-02-25T16:15:01Z"
        namespace: busybox-workloads-2
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2024-02-25T16:15:03Z"
        namespace: busybox-workloads-3
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4
    namespace: openshift-gitops
      namespace: openshift-gitops
    lastGroupSyncTime: "2024-02-26T19:55:00Z"
        namespace: busybox-workloads-4


amagrawa:~$ date -u
Monday 26 February 2024 07:58:45 PM UTC

If we look at the lastGroupSyncTime, it is lagging by 1 or 2 days for most of them but not all.

Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/26feb24/



Expected results: Data replication should work fine while IOs are being run continuously on a RDR setup.


Additional info: