Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): ACM 2.10.2 GA'ed MCE 2.5.2 ODF 4.15.2-1 GA'ed ceph version 17.2.6-209.el9cp (e9529323dd7ab3b0e8cdf84e17a1b58c2b42948c) quincy (stable) OCP 4.15.0-0.nightly-2024-04-30-234425 Submariner 0.17.1 GA'ed VolSync 0.9.1 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: *****Active hub co-situated with primary managed cluster***** 1. When we have multiple workloads (RBD and CephFS) of both subscription and appset types (pull model) and in different states Deployed, FailedOver, Relocated running on primary managed cluster (C1) which goes down along with active hub during site failure at site-1, perform hub recovery and move to passive hub at site-2 (which is co-situated with secondary managed cluster C2). 2. Ensure the available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated. 2. After DRPC is restored, failover all the workloads to available managed cluster C2. 3. When failover is successful, recover the down managed cluster C1 and ensure it's successfully cleaned. 4. Let IOs continue for some time and configure another hub cluster at site-1 to perform hub recovery one more time. 5. Deploy 1 rbd appset (pull)/sub and 1 cephfs appset (pull)/sub on C1 and failover them to C2 (with both the managed clusters up and running). 6. Now relocate some of older workloads to the managed cluster C1 (cluster which was recovered post disaster) and leave remaining workloads as it is on C2 in the failover state. 7. After successful relocate and cleanup, ensure new backups are taken and then perform hub recovery by bringing current active hub at site-2 and C1 cluster down which is at site-1. When moved to new hub at site-1, ensure available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated. 8. When drpc is restored, check for Pods/PVCs/VRs/VRG for the workloads which were running on available cluster C2. Check their last action status on RHACM console and try to failover them. So far the steps to reproduce are same as BZ2276222. Here, primary workloads on C2 had become secondary hence the workaround mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2276222#c9 was applied. 9. After failover to C2 completes(happens due to applying the workaround), recover the down managed cluster C1 and ensure it's successfully cleaned and data sync resumes as expected. 10. Now configure another hub cluster for hub recovery, perform hub recovery by bringing current active hub and C1 cluster down. 11. When moved to new hub, ensure C2 managed cluster is successfully imported, DRPolicy is validated and VolumeSync.Delay alert is being fired as C1 is down and sync isn't progressing. 12. Now recover the down C1 managed cluster and let IOs continue for some time and then delete all the workloads which are in FailedOver state on C2. Actual results: Workload deletion remained stuck forever. Hub- amanagrawal@Amans-MacBook-Pro acm % drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-101 rbd-sub-busybox101-placement-1-drpc 17h amagrawa-c1-13apr Deployed Completed 2024-05-02T17:45:02Z 1.033378153s True busybox-workloads-103 cephfs-sub-busybox103-placement-1-drpc 17h amagrawa-c1-13apr Deployed Completed 2024-05-02T17:45:05Z 578.362118ms True busybox-workloads-11 rbd-sub-busybox11-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-12 rbd-sub-busybox12-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-23 cephfs-sub-busybox23-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-24 rbd-sub-busybox24-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-27 cephfs-sub-busybox27-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-28 rbd-sub-busybox28-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True busybox-workloads-9 rbd-sub-busybox9-placement-1-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops cephfs-appset-busybox102-placement-drpc 17h amagrawa-c1-13apr Deployed Completed 2024-05-02T17:44:59Z 2.133922784s True openshift-gitops cephfs-appset-busybox21-placement-drpc 17h amagrawa-c2-13apr amagrawa-c1-13apr Failover FailedOver Deleting True openshift-gitops cephfs-appset-busybox25-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops cephfs-appset-busybox5-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops cephfs-appset-busybox6-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops cephfs-appset-busybox8-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops rbd-appset-busybox1-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops rbd-appset-busybox100-placement-drpc 17h amagrawa-c1-13apr Deployed Completed 2024-05-02T17:45:04Z 710.593214ms True openshift-gitops rbd-appset-busybox2-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops rbd-appset-busybox22-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops rbd-appset-busybox26-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops rbd-appset-busybox3-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True openshift-gitops rbd-appset-busybox4-placement-drpc 17h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Deleting True No change of state on workloads was observed, meaning none of the resources went to Terminating state. Even workload pods are up and running. C2- amanagrawal@Amans-MacBook-Pro c2 % busybox-3 Now using project "busybox-workloads-3" on server "https://api.amagrawa-c2-13apr.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-41 Bound pvc-6cc700d3-bc86-4f40-b217-262121d40589 42Gi RWO ocs-storagecluster-ceph-rbd 12d Filesystem persistentvolumeclaim/busybox-pvc-42 Bound pvc-eb26f91a-2e22-43ed-b238-cad3e5eeb0a2 81Gi RWO ocs-storagecluster-ceph-rbd 12d Filesystem persistentvolumeclaim/busybox-pvc-43 Bound pvc-3519f05a-e3bd-4772-9bb2-b1d3b5a231f0 28Gi RWO ocs-storagecluster-ceph-rbd 12d Filesystem persistentvolumeclaim/busybox-pvc-44 Bound pvc-adde9d60-03ea-43c2-b59d-c15cbd3bfd6c 118Gi RWO ocs-storagecluster-ceph-rbd 12d Filesystem NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/busybox-pvc-41 12d rbd-volumereplicationclass-539797778 busybox-pvc-41 primary Primary volumereplication.replication.storage.openshift.io/busybox-pvc-42 12d rbd-volumereplicationclass-539797778 busybox-pvc-42 primary Primary volumereplication.replication.storage.openshift.io/busybox-pvc-43 12d rbd-volumereplicationclass-539797778 busybox-pvc-43 primary Primary volumereplication.replication.storage.openshift.io/busybox-pvc-44 12d rbd-volumereplicationclass-539797778 busybox-pvc-44 primary Primary NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox3-placement-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-41-5c55b45d49-vgh2x 1/1 Running 0 12d 10.128.3.142 compute-1 <none> <none> pod/busybox-42-6c6c94c475-pqq52 1/1 Running 0 12d 10.129.2.78 compute-2 <none> <none> pod/busybox-43-5b56997c7b-5hgld 1/1 Running 0 12d 10.129.2.77 compute-2 <none> <none> pod/busybox-44-57856dfdb-4v9tc 1/1 Running 0 12d 10.128.3.143 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % busybox-12 Now using project "busybox-workloads-12" on server "https://api.amagrawa-c2-13apr.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-41 Bound pvc-bffe79d1-e524-467d-9797-a48346a3a535 42Gi RWO ocs-storagecluster-ceph-rbd 12d Filesystem persistentvolumeclaim/busybox-pvc-42 Bound pvc-16145311-ad20-4ec7-b3a4-2a5635eefad6 81Gi RWO ocs-storagecluster-ceph-rbd 12d Filesystem persistentvolumeclaim/busybox-pvc-43 Bound pvc-c8fb5fcd-0e69-45ce-9f35-7b22d2b09767 28Gi RWO ocs-storagecluster-ceph-rbd 12d Filesystem persistentvolumeclaim/busybox-pvc-44 Bound pvc-29ea5d4b-746b-4200-a7e7-cbb2e83984b4 118Gi RWO ocs-storagecluster-ceph-rbd 12d Filesystem NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/busybox-pvc-41 12d rbd-volumereplicationclass-1625360775 busybox-pvc-41 primary Primary volumereplication.replication.storage.openshift.io/busybox-pvc-42 12d rbd-volumereplicationclass-1625360775 busybox-pvc-42 primary Primary volumereplication.replication.storage.openshift.io/busybox-pvc-43 12d rbd-volumereplicationclass-1625360775 busybox-pvc-43 primary Primary volumereplication.replication.storage.openshift.io/busybox-pvc-44 12d rbd-volumereplicationclass-1625360775 busybox-pvc-44 primary Primary NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-sub-busybox12-placement-1-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-41-5c55b45d49-vq9tl 1/1 Running 0 12d 10.129.2.64 compute-2 <none> <none> pod/busybox-42-6c6c94c475-ftvww 1/1 Running 0 12d 10.129.2.66 compute-2 <none> <none> pod/busybox-43-5b56997c7b-gn6bt 1/1 Running 0 12d 10.131.0.179 compute-0 <none> <none> pod/busybox-44-57856dfdb-nkwn4 1/1 Running 0 12d 10.131.0.180 compute-0 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % busybox-8 Now using project "busybox-workloads-8" on server "https://api.amagrawa-c2-13apr.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-128441d5-0eb2-457e-aa0b-5f5052e96939 94Gi RWX ocs-storagecluster-cephfs 19d Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox8-placement-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f9b67dc95-ps9ss 1/1 Running 0 5d1h 10.128.3.173 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % busybox-27 zsh: command not found: busybox-27 amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-27 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-b57ddefe-b1f8-4985-b2d4-58d814579c80 94Gi RWX ocs-storagecluster-cephfs 15d Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox27-placement-1-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f9b67dc95-4ddh8 1/1 Running 0 12d 10.128.3.137 compute-1 <none> <none> Expected results: Workload deletion should be successful. Pods/PVCs/VRs/VRGs/PV and it's images should be cleaned. Additional info:
I couldn't try the workaround but yes, cleanup completed after 24hr eviction period which starts right after the managed cluster is able to connect successfully to the passive hub during/after hub recovery