Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): ACM 2.10.2 GA'ed MCE 2.5.2 ODF 4.15.2-1 GA'ed ceph version 17.2.6-209.el9cp (e9529323dd7ab3b0e8cdf84e17a1b58c2b42948c) quincy (stable) OCP 4.15.0-0.nightly-2024-04-30-234425 Submariner 0.17.1 GA'ed VolSync 0.9.1 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: *****Active hub co-situated with primary managed cluster***** 1. When we have multiple workloads (RBD and CephFS) of both subscription and appset types (pull model) in Deployed state running on primary managed cluster (C1) which goes down along with active hub cluster during site failure at site-1, perform hub recovery and move to passive hub at site-2 (which is co-situated with secondary managed cluster C2). 2. Ensure the available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated. 2. After DRPC is restored, recover the down managed cluster C1 and ensure it's successfully imported on the RHACM console. 4. Let IOs continue for some time (30mins-1hr) and ensure data sync is progressing well. 5. Now failover some of the workloads (with both managed clusters up and running) and relocate remaining ones to the C2 managed cluster during the eviction period timeout (which is currently set to 24hrs). Actual results: [RDR] [Hub recovery] [Co-situated] Relocate operation and cleanup after failover remains stuck during the eviction period timeout Hub- oc get drpc -o wide -A NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-101 rbd-sub-busybox101-placement-1-drpc 14h amagrawa-c2-13apr amagrawa-c1-13apr Relocate Relocating EnsuringVolumesAreSecondary 2024-05-05T17:32:04Z False busybox-workloads-103 cephfs-sub-busybox103-placement-1-drpc 14h amagrawa-c2-13apr amagrawa-c1-13apr Relocate Relocating RunningFinalSync 2024-05-05T17:31:53Z True openshift-gitops cephfs-appset-busybox102-placement-drpc 14h amagrawa-c2-13apr amagrawa-c1-13apr Relocate Relocating RunningFinalSync 2024-05-05T17:31:45Z True openshift-gitops rbd-appset-busybox100-placement-drpc 14h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Cleaning Up 2024-05-05T17:31:57Z False Failover for rbd-appset-busybox100-placement-drpc worked but cleanup is stuck, however relocate of all other workloads in stuck. failedover/relocated from C1 to C2- C2- oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-100 NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/busybox-pvc-41 14h rbd-volumereplicationclass-473128587 busybox-pvc-41 primary Primary NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-41 Bound pvc-7c5e424d-b75a-495d-8745-4d3220fc48e6 42Gi RWO ocs-storagecluster-ceph-rbd 14h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox100-placement-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-41-5c55b45d49-qh7v8 1/1 Running 0 13h 10.129.2.51 compute-2 <none> <none> oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-101 No resources found in busybox-workloads-101 namespace. oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-102 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-542ab575-db38-4187-bbb0-70697ea232f3 94Gi RWX ocs-storagecluster-cephfs 3d16h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox102-placement-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-mw285 1/1 Running 0 2m41s 10.129.2.156 compute-2 <none> <none> oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-103 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-3256f1e5-79e3-43ff-96cb-e0b727ffcc74 94Gi RWX ocs-storagecluster-cephfs 3d16h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox103-placement-1-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-2q7zp 1/1 Running 0 2m46s 10.129.2.155 compute-2 <none> <none> C1- oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-100 NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/busybox-pvc-41 3d16h rbd-volumereplicationclass-473128587 busybox-pvc-41 primary Primary NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-41 Bound pvc-7c5e424d-b75a-495d-8745-4d3220fc48e6 42Gi RWO ocs-storagecluster-ceph-rbd 3d16h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox100-placement-drpc secondary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-41-5c55b45d49-fngg2 1/1 Running 2 3d16h 10.128.3.196 compute-0 <none> <none> oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-101 NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/busybox-pvc-41 3d16h rbd-volumereplicationclass-473128587 busybox-pvc-41 primary Primary NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-41 Bound pvc-ea52541b-acb4-4ecb-afd6-a00925bf3583 42Gi RWO ocs-storagecluster-ceph-rbd 3d16h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-sub-busybox101-placement-1-drpc secondary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-41-5c55b45d49-59tbp 1/1 Running 2 3d16h 10.128.3.198 compute-0 <none> <none> oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-102 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-46f34593-ba2f-435c-801d-66b7371fd359 94Gi RWX ocs-storagecluster-cephfs 3d16h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox102-placement-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f9b67dc95-wq4tr 1/1 Running 2 3d16h 10.128.3.208 compute-0 <none> <none> oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-103 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-40f27ae0-e3f5-4e74-822e-0eab289f3232 94Gi RWX ocs-storagecluster-cephfs 3d16h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox103-placement-1-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f9b67dc95-v6h4s 1/1 Running 2 3d16h 10.128.3.209 compute-0 <none> <none> Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/05may24/ Expected results: Admin should be able to successfully failover/relocate the workloads independent of eviction period timeout post hub recovery. Additional info:
For the ease of reference, DRPolicy got validated on passive hub around Sun May 5 17:01:00 UTC 2024 and failover/relocate was performed somewhere close to Sun May 5 17:32:24 UTC 2024
Relocate and cleanup of failedover workload completed successfully after the eviction period.