Bug 2251205 - [4.14 clone] [RDR] [Hub recovery] Sync for all cephfs workloads stopped post hub recovery
Summary: [4.14 clone] [RDR] [Hub recovery] Sync for all cephfs workloads stopped post ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.14.1
Assignee: Benamar Mekhissi
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On: 2250152
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-23 13:10 UTC by Karolin Seeger
Modified: 2023-12-07 13:21 UTC (History)
5 users (show)

Fixed In Version: 4.14.1-12
Doc Type: No Doc Update
Doc Text:
Clone Of: 2250152
Environment:
Last Closed: 2023-12-07 13:21:41 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github RamenDR ramen pull 1144 0 None Merged Add label to volsync secret for inclusion in hub recovery backup 2023-11-23 13:26:03 UTC
Github red-hat-storage ramen pull 158 0 None open Bug 2251205: Add label to volsync secret for inclusion in hub recovery backup 2023-11-23 14:10:49 UTC
Red Hat Product Errata RHBA-2023:7696 0 None None None 2023-12-07 13:21:42 UTC

Description Karolin Seeger 2023-11-23 13:10:40 UTC
+++ This bug was initially created as a clone of Bug #2250152 +++

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-09-204811
Volsync 0.8.0
Submariner 0.16.2
ACM quay.io:443/acm-d/acm-custom-registry:v2.9.0-RC2        
odf-multicluster-orchestrator.v4.14.1-rhodf  
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

**Active hub at neutral site**

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster). (A few of them are exception, check drpc -o wide status in Step 3).
3. Ensure that we have the workloads in distict states like deployed, failedover, relocated etc.

Here amagrawa-10n-1 is C1 primary managed cluster for me:

From active hub-

amagrawa:hub$ drpc
NAMESPACE              NAME                                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-12   cephfs-sub-busybox-workloads-12-placement-1-drpc    7h18m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:54:29Z   5m59.196575462s   True
busybox-workloads-13   cephfs-sub-busybox-workloads-13-placement-1-drpc    7h17m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T12:12:36Z   5m58.842880173s   True
busybox-workloads-14   cephfs-sub-busybox-workloads-14-placement-1-drpc    7h16m   amagrawa-10n-1     amagrawa-10n-2    Failover       FailedOver     Completed     2023-11-16T08:29:07Z   3m19.098202668s   True
busybox-workloads-6    rbd-sub-busybox-workloads-6-placement-1-drpc        7h35m   amagrawa-10n-1                                      Deployed       Completed                                              True
busybox-workloads-7    rbd-sub-busybox-workloads-7-placement-1-drpc        7h34m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:53:38Z   9m59.85663627s    True
busybox-workloads-8    rbd-sub-busybox-workloads-8-placement-1-drpc        7h32m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:21:05Z   4m13.272955733s   True
openshift-gitops       cephfs-appset-busybox-workloads-10-placement-drpc   7h22m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:15:50Z   3m22.540081438s   True
openshift-gitops       cephfs-appset-busybox-workloads-11-placement-drpc   7h20m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:00:32Z   5m38.794985745s   True
openshift-gitops       cephfs-appset-busybox-workloads-9-placement-drpc    7h24m   amagrawa-10n-2                       Relocate       Relocated      Completed     2023-11-16T08:28:59Z   8m47.541429779s   True
openshift-gitops       rbd-appset-busybox-workloads-1-placement-drpc       7h43m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:16:14Z   8m31.330049487s   True
openshift-gitops       rbd-appset-busybox-workloads-2-placement-drpc       7h42m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:16:28Z   7m59.477897296s   True
openshift-gitops       rbd-appset-busybox-workloads-3-placement-drpc       7h41m   amagrawa-10n-2     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:27:18Z   7m4.760183798s    True
openshift-gitops       rbd-appset-busybox-workloads-4-placement-drpc       7h39m   amagrawa-10n-1                                      Deployed       Completed                                              True

4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.

amagrawa:hub$ group|grep SyncTime
    lastGroupSyncTime: "2023-11-16T14:01:32Z"
    lastGroupSyncTime: "2023-11-16T14:06:09Z"
    lastGroupSyncTime: "2023-11-16T14:01:03Z"
    lastGroupSyncTime: "2023-11-16T13:45:09Z"
    lastGroupSyncTime: "2023-11-16T13:50:51Z"
    lastGroupSyncTime: "2023-11-16T13:50:40Z"
    lastGroupSyncTime: "2023-11-16T14:00:51Z"
    lastGroupSyncTime: "2023-11-16T14:06:12Z"
    lastGroupSyncTime: "2023-11-16T13:01:45Z"
    lastGroupSyncTime: "2023-11-16T13:50:36Z"
    lastGroupSyncTime: "2023-11-16T13:45:16Z"
    lastGroupSyncTime: "2023-11-16T13:56:22Z"
    lastGroupSyncTime: "2023-11-16T13:45:11Z"


amagrawa:hub$ date -u
Thursday 16 November 2023 02:12:11 PM UTC

5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.

They seem to have retained their last state which was backedup. So everything is fine so far.

amagrawa:~$ drpc
NAMESPACE              NAME                                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME   DURATION   PEER READY
busybox-workloads-12   cephfs-sub-busybox-workloads-12-placement-1-drpc    4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
busybox-workloads-13   cephfs-sub-busybox-workloads-13-placement-1-drpc    4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
busybox-workloads-14   cephfs-sub-busybox-workloads-14-placement-1-drpc    4h16m   amagrawa-10n-1     amagrawa-10n-2    Failover       FailedOver     Completed                             True
busybox-workloads-6    rbd-sub-busybox-workloads-6-placement-1-drpc        4h16m   amagrawa-10n-1                                      Deployed       Completed                             True
busybox-workloads-7    rbd-sub-busybox-workloads-7-placement-1-drpc        4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
busybox-workloads-8    rbd-sub-busybox-workloads-8-placement-1-drpc        4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-10-placement-drpc   4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-11-placement-drpc   4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-9-placement-drpc    4h16m   amagrawa-10n-2                       Relocate       Relocated      Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-1-placement-drpc       4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-2-placement-drpc       4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-3-placement-drpc       4h16m   amagrawa-10n-2     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-4-placement-drpc       4h16m   amagrawa-10n-1                                      Deployed       Completed                             True

7. Let IOs continue for a few hours. We observed that data sync for rbd based workloads were progressing just fine but sync stopped for all the cephfs based workloads be it of subsciption or appset type.


Actual results: Sync for all cephfs workloads stopped post hub recovery.

Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/16nov23/logs/

VolumeSyncronizationDelay alert fires on passive hub for all cephfs workloads when monitoring label is applied.

Expected results: Sync for all cephfs workloads should continue without any issues post hub recovery.


Additional info:

--- Additional comment from RHEL Program Management on 2023-11-16 19:05:30 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-11-16 19:05:30 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Benamar Mekhissi on 2023-11-17 02:42:37 UTC ---

Submariner lighthouse agent log is showing port conflicts every a couple of seconds.  The same message can be seen in the ReplicationDestination ServiceExport:

```
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
  creationTimestamp: "2023-11-16T08:20:32Z"
  generation: 1
  name: volsync-rsync-tls-dst-busybox-pvc-4
  namespace: busybox-workloads-10
  ownerReferences:
  - apiVersion: volsync.backube/v1alpha1
    kind: ReplicationDestination
    name: busybox-pvc-4
    uid: 6c48fffa-4375-4845-9272-6801f94b5ac3
  resourceVersion: "4909434"
  uid: 72bfc6a0-1fed-434b-bfb5-7ed9674eba9e
status:
  conditions:
  - lastTransitionTime: "2023-11-16T08:20:42Z"
    message: ""
    reason: ""
    status: "True"
    type: Valid
  - lastTransitionTime: "2023-11-16T08:20:52Z"
    message: Service was successfully exported to the broker
    reason: ""
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-11-16T14:35:02Z"
    message: 'The service ports conflict between the constituent clusters [amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1,
      amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1,
      amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-2,
      amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1,
      amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-1,
      amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1,
      amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-1, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1,
      amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-2, amagrawa-10n-2,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1, amagrawa-10n-1,
      amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-2, amagrawa-10n-1,
      amagrawa-10n-1]. The service will expose the intersection of all the ports: '
    reason: ConflictingPorts
    status: "True"
    type: Conflict
```

Submariner team is engaged. We've been asked to run `subctl verify`.  @amagrawa can you please run the "subctl verify" test and capture the output.

I ran the verification manually https://submariner.io/getting-started/quickstart/kind/#verify-manually and that seems to timeout.

--- Additional comment from Aman Agrawal on 2023-11-17 08:00:30 UTC ---

subctl verify connectivity check with context to both the managed clusters passed

--- Additional comment from Benamar Mekhissi on 2023-11-17 19:58:21 UTC ---

There are two issues contributing to this problem. The first, mentioned earlier, involves a Submariner issue that doesn't impact functionality but can be misleading and challenging to diagnose. The Submariner team is addressing it in subsequent releases.

The second issue relates to ACM/ODF, specifically the automatic backup of `policies.policy.open-cluster-management.io`. These ACM policies, are responsible for delivering the volsync mover secret from the hub to managed clusters. The volsync mover secrets aren't backed up. This results in a new active hub having a policy that doesn't point to the secret, causing the mover pod to fail in establishing a TLS connection.

We will find a suitable solution for version 4.15. Meanwhile, the workaround involves deleting these policies from the hub, prompting their regeneration with the correct secrets. Additionally, the mover pods on the destination cluster have be deleted as well.

For example; on the new hub, I have the following policies:
```
oc get policy -A | grep vs-secret | grep amagrawa-10n
amagrawa-10n-1                   busybox-workloads-12.vs-secret-db29abb9d806bf0ea8690fda8bc7d9cb                        Compliant          28h
amagrawa-10n-1                   busybox-workloads-13.vs-secret-8f116547aba3d58497d1c6440090c6c4                        Compliant          14m
amagrawa-10n-1                   busybox-workloads-14.vs-secret-3afb3c0dc0b9646dceef4fe1bd3da4a6                        Compliant          28h
amagrawa-10n-1                   openshift-gitops.vs-secret-6492ecc11b403505c9f30eb84106ac1b                            Compliant          28h
amagrawa-10n-1                   openshift-gitops.vs-secret-9dbea625099ac8d1b144c293ee183452                            Compliant          28h
amagrawa-10n-1                   openshift-gitops.vs-secret-eadf8882df7cd2a1a52db6e4631e5e98                            Compliant          28h
amagrawa-10n-2                   busybox-workloads-12.vs-secret-db29abb9d806bf0ea8690fda8bc7d9cb                        NonCompliant       28m
amagrawa-10n-2                   busybox-workloads-13.vs-secret-8f116547aba3d58497d1c6440090c6c4                        Compliant          37m
amagrawa-10n-2                   busybox-workloads-14.vs-secret-3afb3c0dc0b9646dceef4fe1bd3da4a6                        NonCompliant       28h
amagrawa-10n-2                   openshift-gitops.vs-secret-6492ecc11b403505c9f30eb84106ac1b                            NonCompliant       28h
amagrawa-10n-2                   openshift-gitops.vs-secret-9dbea625099ac8d1b144c293ee183452                            NonCompliant       28h
amagrawa-10n-2                   openshift-gitops.vs-secret-eadf8882df7cd2a1a52db6e4631e5e98                            NonCompliant       28h
```

Any policy that shows "NonCompliant" for the Compliance state should be deleted. New ones will be regenerated with the right secrets.

Also note the cluster name in the first column, and delete all the jobs that are pending on that cluster for the workload namespace.

Example:
```
oc -n busybox-workloads-12 delete jobs --all
job.batch "volsync-rsync-tls-dst-busybox-pvc-1" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-10" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-11" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-12" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-13" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-14" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-15" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-16" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-17" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-18" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-19" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-2" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-20" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-3" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-4" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-5" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-6" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-7" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-8" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-9" deleted
```

--- Additional comment from Aman Agrawal on 2023-11-18 07:13:39 UTC ---

Benamar had already applied the workaround for workloads-12, 13, and 14 and had also deleted all the NonCompliant secrets.

When I checked (on passive hub)-
amagrawa:~$ oc get policy -A | grep vs-secret | grep amagrawa-10n
amagrawa-10n-1                   benamar-ns.busybox-drpc-vs-secret                                                      Compliant          10h
amagrawa-10n-1                   busybox-workloads-12.vs-secret-db29abb9d806bf0ea8690fda8bc7d9cb                        Compliant          10h
amagrawa-10n-1                   busybox-workloads-13.vs-secret-8f116547aba3d58497d1c6440090c6c4                        Compliant          10h
amagrawa-10n-1                   busybox-workloads-14.vs-secret-3afb3c0dc0b9646dceef4fe1bd3da4a6                        Compliant          10h
amagrawa-10n-1                   openshift-gitops.vs-secret-6492ecc11b403505c9f30eb84106ac1b                            Compliant          10h
amagrawa-10n-1                   openshift-gitops.vs-secret-9dbea625099ac8d1b144c293ee183452                            Compliant          10h
amagrawa-10n-1                   openshift-gitops.vs-secret-eadf8882df7cd2a1a52db6e4631e5e98                            Compliant          10h
amagrawa-10n-2                   benamar-ns.busybox-drpc-vs-secret                                                      Compliant          10h
amagrawa-10n-2                   busybox-workloads-12.vs-secret-db29abb9d806bf0ea8690fda8bc7d9cb                        Compliant          10h
amagrawa-10n-2                   busybox-workloads-13.vs-secret-8f116547aba3d58497d1c6440090c6c4                        Compliant          10h
amagrawa-10n-2                   busybox-workloads-14.vs-secret-3afb3c0dc0b9646dceef4fe1bd3da4a6                        Compliant          10h
amagrawa-10n-2                   openshift-gitops.vs-secret-6492ecc11b403505c9f30eb84106ac1b                            Compliant          10h
amagrawa-10n-2                   openshift-gitops.vs-secret-9dbea625099ac8d1b144c293ee183452                            Compliant          10h
amagrawa-10n-2                   openshift-gitops.vs-secret-eadf8882df7cd2a1a52db6e4631e5e98                            Compliant          10h


All the secrets were complaint.

So I moved to the secondary cluster where destination pods were running but stuck for 40+hrs and then ran the below cmd:

amagrawa$ oc -n busybox-workloads-10 delete jobs --all
job.batch "volsync-rsync-tls-dst-busybox-pvc-1" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-10" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-11" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-12" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-13" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-14" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-15" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-16" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-17" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-18" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-19" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-2" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-20" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-3" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-4" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-5" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-6" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-7" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-8" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-9" deleted


amagrawa$ oc -n busybox-workloads-11 delete jobs --all
job.batch "volsync-rsync-tls-dst-busybox-pvc-1" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-10" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-11" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-12" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-13" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-14" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-15" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-16" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-17" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-18" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-19" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-2" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-20" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-3" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-4" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-5" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-6" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-7" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-8" deleted
job.batch "volsync-rsync-tls-dst-busybox-pvc-9" deleted


It re-created the dst pods on the secondary cluster and data sync resumes as soon as the next sync interval in reached.
I can confirm that the workaround works well and data sync is working fine for all the impacted cephfs workloads. 

@bmekhiss thank you so much for putting so much time and efforts on issues like this, you are a real saviour. :)

Comment 11 errata-xmlrpc 2023-12-07 13:21:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.1 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7696


Note You need to log in before you can comment on or make changes to this bug.