Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): OCP 4.14.0-0.nightly-2023-11-09-204811 Volsync 0.8.0 Submariner 0.16.2 ACM quay.io:443/acm-d/acm-custom-registry:v2.9.0-RC2 odf-multicluster-orchestrator.v4.14.1-rhodf ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) Latency 50ms RTT Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: **Active hub at neutral site** 1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types. 2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster). (A few of them are exception, check drpc -o wide status in Step 3). 3. Ensure that we have the workloads in distict states like deployed, failedover, relocated etc. Here amagrawa-10n-1 is C1 primary managed cluster for me: From active hub- amagrawa:hub$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-12 cephfs-sub-busybox-workloads-12-placement-1-drpc 7h18m amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed 2023-11-16T08:54:29Z 5m59.196575462s True busybox-workloads-13 cephfs-sub-busybox-workloads-13-placement-1-drpc 7h17m amagrawa-10n-1 Relocate Relocated Completed 2023-11-16T12:12:36Z 5m58.842880173s True busybox-workloads-14 cephfs-sub-busybox-workloads-14-placement-1-drpc 7h16m amagrawa-10n-1 amagrawa-10n-2 Failover FailedOver Completed 2023-11-16T08:29:07Z 3m19.098202668s True busybox-workloads-6 rbd-sub-busybox-workloads-6-placement-1-drpc 7h35m amagrawa-10n-1 Deployed Completed True busybox-workloads-7 rbd-sub-busybox-workloads-7-placement-1-drpc 7h34m amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed 2023-11-16T08:53:38Z 9m59.85663627s True busybox-workloads-8 rbd-sub-busybox-workloads-8-placement-1-drpc 7h32m amagrawa-10n-1 Relocate Relocated Completed 2023-11-16T08:21:05Z 4m13.272955733s True openshift-gitops cephfs-appset-busybox-workloads-10-placement-drpc 7h22m amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed 2023-11-16T08:15:50Z 3m22.540081438s True openshift-gitops cephfs-appset-busybox-workloads-11-placement-drpc 7h20m amagrawa-10n-1 Relocate Relocated Completed 2023-11-16T08:00:32Z 5m38.794985745s True openshift-gitops cephfs-appset-busybox-workloads-9-placement-drpc 7h24m amagrawa-10n-2 Relocate Relocated Completed 2023-11-16T08:28:59Z 8m47.541429779s True openshift-gitops rbd-appset-busybox-workloads-1-placement-drpc 7h43m amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed 2023-11-16T08:16:14Z 8m31.330049487s True openshift-gitops rbd-appset-busybox-workloads-2-placement-drpc 7h42m amagrawa-10n-1 Relocate Relocated Completed 2023-11-16T08:16:28Z 7m59.477897296s True openshift-gitops rbd-appset-busybox-workloads-3-placement-drpc 7h41m amagrawa-10n-2 amagrawa-10n-1 Failover FailedOver Completed 2023-11-16T08:27:18Z 7m4.760183798s True openshift-gitops rbd-appset-busybox-workloads-4-placement-drpc 7h39m amagrawa-10n-1 Deployed Completed True 4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc. amagrawa:hub$ group|grep SyncTime lastGroupSyncTime: "2023-11-16T14:01:32Z" lastGroupSyncTime: "2023-11-16T14:06:09Z" lastGroupSyncTime: "2023-11-16T14:01:03Z" lastGroupSyncTime: "2023-11-16T13:45:09Z" lastGroupSyncTime: "2023-11-16T13:50:51Z" lastGroupSyncTime: "2023-11-16T13:50:40Z" lastGroupSyncTime: "2023-11-16T14:00:51Z" lastGroupSyncTime: "2023-11-16T14:06:12Z" lastGroupSyncTime: "2023-11-16T13:01:45Z" lastGroupSyncTime: "2023-11-16T13:50:36Z" lastGroupSyncTime: "2023-11-16T13:45:16Z" lastGroupSyncTime: "2023-11-16T13:56:22Z" lastGroupSyncTime: "2023-11-16T13:45:11Z" amagrawa:hub$ date -u Thursday 16 November 2023 02:12:11 PM UTC 5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated. 6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not. They seem to have retained their last state which was backedup. So everything is fine so far. amagrawa:~$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-12 cephfs-sub-busybox-workloads-12-placement-1-drpc 4h16m amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed True busybox-workloads-13 cephfs-sub-busybox-workloads-13-placement-1-drpc 4h16m amagrawa-10n-1 Relocate Relocated Completed True busybox-workloads-14 cephfs-sub-busybox-workloads-14-placement-1-drpc 4h16m amagrawa-10n-1 amagrawa-10n-2 Failover FailedOver Completed True busybox-workloads-6 rbd-sub-busybox-workloads-6-placement-1-drpc 4h16m amagrawa-10n-1 Deployed Completed True busybox-workloads-7 rbd-sub-busybox-workloads-7-placement-1-drpc 4h16m amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed True busybox-workloads-8 rbd-sub-busybox-workloads-8-placement-1-drpc 4h16m amagrawa-10n-1 Relocate Relocated Completed True openshift-gitops cephfs-appset-busybox-workloads-10-placement-drpc 4h16m amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed True openshift-gitops cephfs-appset-busybox-workloads-11-placement-drpc 4h16m amagrawa-10n-1 Relocate Relocated Completed True openshift-gitops cephfs-appset-busybox-workloads-9-placement-drpc 4h16m amagrawa-10n-2 Relocate Relocated Completed True openshift-gitops rbd-appset-busybox-workloads-1-placement-drpc 4h16m amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed True openshift-gitops rbd-appset-busybox-workloads-2-placement-drpc 4h16m amagrawa-10n-1 Relocate Relocated Completed True openshift-gitops rbd-appset-busybox-workloads-3-placement-drpc 4h16m amagrawa-10n-2 amagrawa-10n-1 Failover FailedOver Completed True openshift-gitops rbd-appset-busybox-workloads-4-placement-drpc 4h16m amagrawa-10n-1 Deployed Completed True 7. Let IOs continue for a few hours. We observed that data sync for rbd based workloads were progressing just fine but sync stopped for all the cephfs based workloads be it of subsciption or appset type. Actual results: Sync for all cephfs workloads stopped post hub recovery. Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/16nov23/logs/ VolumeSyncronizationDelay alert fires on passive hub for all cephfs workloads when monitoring label is applied. Expected results: Sync for all cephfs workloads should continue without any issues post hub recovery. Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383