Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): OCP 4.15.0-0.nightly-2024-02-27-181650 ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55 ODF 4.15.0-150 ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: **Active hub co-situated with the primary managed cluster at site1** 1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types. 2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster) but the apps which were failedover from C1 to C2 were relocated back to C1 and the apps which were relocated to C2 were failedover to C1 (with all nodes up and running). Ensure that we have all workloads combinations in distinct states like deployed, failedover, relocated on C1, and a few workloads remain in deployed state on C2 as well. 4. Let the latest backups be taken at least 1 for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc. 5. Perform site failure (bring active hub and primary managed cluster down), move to passive hub at site2 which is co-situated with the secondary managed cluster by performing hub recovery. Restore backps, ensure velero backup reports successful restoration. Make sure the secondary managed cluster is successfully imported, drpolicy gets validated. 6. Wait for drpc progression to be restored. 7. Failover all the rbd and cephfs workloads running on primary managed cluster which went down to secondary and observe the status. Primary managed cluster remains down. Actual results: [RDR] [Hub recovery] [Co-situated] Missing VRCs blocks failover operation for RBD workloads Failover was successful for all CephFS workloads but all the RBD remained stuck amagrawa:~$ drpc|grep rbd busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 26h amagrawa-new-c1 amagrawa-new-m2 Failover FailedOver WaitForReadiness 2024-03-03T20:40:08Z False busybox-workloads-11 rbd-sub-busybox11-placement-1-drpc 26h amagrawa-new-c1 amagrawa-new-m2 Failover FailedOver WaitForReadiness 2024-03-03T20:40:15Z False busybox-workloads-12 rbd-sub-busybox12-placement-1-drpc 26h amagrawa-new-m2 Deployed EnsuringVolSyncSetup 2024-03-03T10:28:54Z 461.285449ms True busybox-workloads-9 rbd-sub-busybox9-placement-1-drpc 26h amagrawa-new-c1 amagrawa-new-m2 Failover FailedOver WaitForReadiness 2024-03-03T20:40:01Z False openshift-gitops rbd-appset-busybox1-placement-drpc 26h amagrawa-new-c1 amagrawa-new-m2 Failover FailedOver WaitForReadiness 2024-03-03T20:40:22Z False openshift-gitops rbd-appset-busybox2-placement-drpc 26h amagrawa-new-c1 amagrawa-new-m2 Failover FailedOver WaitForReadiness 2024-03-03T20:40:27Z False openshift-gitops rbd-appset-busybox3-placement-drpc 26h amagrawa-new-c1 amagrawa-new-m2 Failover FailedOver WaitForReadiness 2024-03-03T20:40:34Z False openshift-gitops rbd-appset-busybox4-placement-drpc 26h amagrawa-new-m2 Deployed EnsuringVolSyncSetup 2024-03-03T10:28:55Z 365.204907ms True Logs collected from passive hub and secondary managed cluster after observing that failover isn't progressing- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/04march24/ Expected results: Failover should complete for all the workloads Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383