Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): OCP 4.15.0-0.nightly-2024-01-03-015912 ACM GA'ed 2.9.1 ODF 4.15.0-104 ceph version 17.2.6-167.el9cp (5ef1496ea3e9daaa9788809a172bd5a1c3192cf7) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Active hub co-situated with primary managed cluster 1. On a hub recovery RDR setup, ensure backups are being created on active and passive hub clusters. Failover and relocate different workloads so that it is finally running on the primary managed cluster after the failover and relocate operation completes. Ensure latest backups are taken and no action of any of the workloads (cephfs, rbd- appset and subscription type each in distinct state like Deployed, FailedOver and Relocated) is in progress. Also have a few workloads running on the secondary managed cluster. 2. Collect drpc status. from active hub amagrawa:hub$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-1 cephfs-sub-busybox1-placement-1-drpc 8d amagrawa-c1-3jan amagrawa-c1-3jan Failover FailedOver Completed 2024-01-12T12:18:52Z 2m55.286783085s True busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 7h11m amagrawa-c1-3jan Deployed Completed 2024-01-12T09:17:04Z 2.056054748s True busybox-workloads-11 rbd-sub-busybox11-placement-1-drpc 7h10m amagrawa-c1-3jan Deployed Completed 2024-01-12T09:17:49Z 21.043985378s True busybox-workloads-12 rbd-sub-busybox12-placement-1-drpc 7h8m amagrawa-c2-3jan Deployed Completed 2024-01-12T09:19:19Z 88.077165ms True busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 7h4m amagrawa-c2-3jan Deployed Completed 2024-01-12T09:23:37Z 52.119470621s True busybox-workloads-2 rbd-sub-busybox2-placement-1-drpc 8d amagrawa-c1-3jan amagrawa-c1-3jan Failover FailedOver Completed 2024-01-12T12:19:42Z 6m5.284279162s True busybox-workloads-5 rbd-sub-busybox5-placement-1-drpc 4d5h amagrawa-c1-3jan Relocate Relocated Completed 2024-01-12T12:19:49Z 2m58.152248084s True busybox-workloads-6 cephfs-sub-busybox6-placement-1-drpc 7h18m amagrawa-c1-3jan Relocate Relocated Completed 2024-01-12T12:18:59Z 2m53.478551336s True busybox-workloads-7 cephfs-sub-busybox7-placement-1-drpc 7h16m amagrawa-c1-3jan Deployed Completed 2024-01-12T09:11:18Z 32.120615573s True openshift-gitops cephfs-appset-busybox16-placement-drpc 7h3m amagrawa-c2-3jan amagrawa-c1-3jan Failover FailedOver Completed 2024-01-12T12:19:33Z 2m37.50568738s True openshift-gitops cephfs-appset-busybox3-placement-drpc 4d6h amagrawa-c2-3jan amagrawa-c2-3jan Failover FailedOver Completed 2024-01-12T12:22:54Z 2m36.257186541s True openshift-gitops cephfs-appset-busybox8-placement-drpc 7h14m amagrawa-c1-3jan Relocate Relocated Completed 2024-01-12T12:19:20Z 4m10.339668753s True openshift-gitops cephfs-appset-busybox9-placement-drpc 7h13m amagrawa-c1-3jan Deployed Completed 2024-01-12T09:15:06Z 32.175780774s True openshift-gitops rbd-appset-busybox13-placement-drpc 7h6m amagrawa-c1-3jan Relocate Relocated Completed 2024-01-12T12:20:02Z 6m7.188328151s True openshift-gitops rbd-appset-busybox14-placement-drpc 7h5m amagrawa-c2-3jan amagrawa-c1-3jan Failover FailedOver Completed 2024-01-12T12:20:12Z 5m29.864938194s True openshift-gitops rbd-appset-busybox17-placement-drpc 4h3m amagrawa-c2-3jan Deployed Completed 2024-01-12T12:24:43Z 15.046600381s True openshift-gitops rbd-appset-busybox4-placement-drpc 4d6h amagrawa-c2-3jan amagrawa-c1-3jan Failover FailedOver Completed 2024-01-12T12:01:19Z 8m27.272353624s True Ensure data sync is progressing well. Perform site failure meaning bring primary managed cluster along with active hub down. 3. Ensure secondary managed cluster is properly imported on the passive hub and wait for DRPolicy to get validated. 4. On checking the drpc from passive hub, it was found that Progression state is different which is based upon last known state of the individual workloads. amagrawa:acm$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-1 cephfs-sub-busybox1-placement-1-drpc 39m amagrawa-c1-3jan amagrawa-c1-3jan Failover Paused True busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 39m amagrawa-c1-3jan Paused True busybox-workloads-11 rbd-sub-busybox11-placement-1-drpc 39m amagrawa-c1-3jan Paused True busybox-workloads-12 rbd-sub-busybox12-placement-1-drpc 39m amagrawa-c2-3jan Deployed Completed 2024-01-12T16:52:33Z 963.058163ms True busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 39m amagrawa-c2-3jan Deployed EnsuringVolSyncSetup 2024-01-12T16:53:31Z True busybox-workloads-2 rbd-sub-busybox2-placement-1-drpc 39m amagrawa-c1-3jan amagrawa-c1-3jan Failover Paused True busybox-workloads-5 rbd-sub-busybox5-placement-1-drpc 39m amagrawa-c1-3jan Relocate Paused True busybox-workloads-6 cephfs-sub-busybox6-placement-1-drpc 39m amagrawa-c1-3jan Relocate Paused True busybox-workloads-7 cephfs-sub-busybox7-placement-1-drpc 39m amagrawa-c1-3jan Paused True openshift-gitops cephfs-appset-busybox16-placement-drpc 39m amagrawa-c2-3jan amagrawa-c1-3jan Failover Paused True openshift-gitops cephfs-appset-busybox3-placement-drpc 39m amagrawa-c2-3jan amagrawa-c2-3jan Failover FailedOver Cleaning Up True openshift-gitops cephfs-appset-busybox8-placement-drpc 39m amagrawa-c1-3jan Relocate Paused True openshift-gitops cephfs-appset-busybox9-placement-drpc 39m amagrawa-c1-3jan Paused True openshift-gitops rbd-appset-busybox13-placement-drpc 39m amagrawa-c1-3jan Relocate Paused True openshift-gitops rbd-appset-busybox14-placement-drpc 39m amagrawa-c2-3jan amagrawa-c1-3jan Failover Paused True openshift-gitops rbd-appset-busybox17-placement-drpc 39m amagrawa-c2-3jan Deployed Completed 2024-01-12T16:52:32Z 1.263215882s True openshift-gitops rbd-appset-busybox4-placement-drpc 39m amagrawa-c2-3jan amagrawa-c1-3jan Failover Paused True 5. Since primary managed cluster is still down, data sync can't progress. Now perform failover of all the workloads which were running on the down cluster to the secondary or failovercluster and track it's progress. After failover from C1 to C2- amagrawa:acm$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-1 cephfs-sub-busybox1-placement-1-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover FailedOver Cleaning Up 2024-01-12T18:31:43Z False busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused True busybox-workloads-11 rbd-sub-busybox11-placement-1-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused True busybox-workloads-12 rbd-sub-busybox12-placement-1-drpc 146m amagrawa-c2-3jan Deployed Completed 2024-01-12T16:52:33Z 963.058163ms True busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 146m amagrawa-c2-3jan Deployed EnsuringVolSyncSetup 2024-01-12T16:53:31Z True busybox-workloads-2 rbd-sub-busybox2-placement-1-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused True busybox-workloads-5 rbd-sub-busybox5-placement-1-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused True busybox-workloads-6 cephfs-sub-busybox6-placement-1-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused False busybox-workloads-7 cephfs-sub-busybox7-placement-1-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused False openshift-gitops cephfs-appset-busybox16-placement-drpc 146m amagrawa-c2-3jan amagrawa-c2-3jan Failover FailedOver Cleaning Up 2024-01-12T18:32:17Z False openshift-gitops cephfs-appset-busybox3-placement-drpc 146m amagrawa-c2-3jan amagrawa-c2-3jan Failover FailedOver Cleaning Up True openshift-gitops cephfs-appset-busybox8-placement-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused False openshift-gitops cephfs-appset-busybox9-placement-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused False openshift-gitops rbd-appset-busybox13-placement-drpc 146m amagrawa-c1-3jan amagrawa-c2-3jan Failover Paused True openshift-gitops rbd-appset-busybox14-placement-drpc 146m amagrawa-c2-3jan amagrawa-c2-3jan Failover Paused True openshift-gitops rbd-appset-busybox17-placement-drpc 146m amagrawa-c2-3jan Deployed Completed 2024-01-12T16:52:32Z 1.263215882s True openshift-gitops rbd-appset-busybox4-placement-drpc 146m amagrawa-c2-3jan amagrawa-c2-3jan Failover Paused True Cluster C2 with name amagrawa-c2-3jan is the secondary or failover cluster which is up and running while C1 (from the logs) is down. Actual results: From the above drpc output, failover did not even start for any of the RBD workloads. However, it did for a few cephfs but not for Cephfs workloads under NS busybox-workloads-7, 8 and 9. Logs before peforming hub recovery is kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/12jan24-active-415/ Logs from passive hub after performing failover- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/13jan24-after-failover/ Expected results: Failover should progress, all the workload pods should be up and running on the failovercluster and VRG both states should be marked as Primary. Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383