Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): ODF 4.15.0-132.stable OCP 4.15.0-0.nightly-2024-02-13-231030 ACM 2.9.2 GA'ed Submariner 0.16.3 ceph version 17.2.6-194.el9cp (d9f4aedda0fc0d99e7e0e06892a69523d2eb06dc) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: **Active hub at neutral site** 1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types. 2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster) but the apps which were failedover from C1 to C2 were relocated back to C1 and the apps which were relocated to C2 were failedover to C1 (with all nodes up and running). Ensure that we have all workloads combinations in distinct states like deployed, failedover, relocated on C1, and a few workloads in deployed state on C2 as well. 4. Let the latest backups be taken at least 1 for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc. 5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated. 6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not. They seem to have retained their last state which was backedup. So everything is fine so far. 7. Let IOs continue for a few hours (20-30hrs). Failover the cephfs workloads running on C2 to C1 with all nodes of C2 up and running. 8. After successful failover and cleanup, wait for sync to resume and after some time bring primary cluster down (all nodes). Bring it up after a few hours. 9. Check if drpc state is still the same and data sync for all workload is resuming as expected. 10. After a few hours, bring master nodes of primary cluster down and failover all the workloads running on primary after the cluster is marked offline on RHACM console and observe the failover status. Output collected around Sunday 18 February 2024 07:55:50 PM UTC (long after failover was triggered) amagrawa:hub$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-13 sub-rbd-busybox13-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:20:17Z False busybox-workloads-14 sub-rbd-busybox14-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:20:25Z False busybox-workloads-15 sub-rbd-busybox15-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:20:33Z False busybox-workloads-16 sub-rbd-busybox16-placement-1-drpc 2d9h amagrawa-odf2 Deployed Completed 2024-02-16T10:12:51Z 660.371688ms True busybox-workloads-5 sub-cephfs-busybox5-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Relocate Relocating 2024-02-18T19:16:07Z False busybox-workloads-6 sub-cephfs-busybox6-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:19:49Z False busybox-workloads-7 sub-cephfs-busybox7-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:19:59Z False busybox-workloads-8 sub-cephfs-busybox8-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:20:06Z False openshift-gitops appset-cephfs-busybox1-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:18:33Z False openshift-gitops appset-cephfs-busybox2-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:18:38Z False openshift-gitops appset-cephfs-busybox3-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:18:43Z False openshift-gitops appset-cephfs-busybox4-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:18:48Z False openshift-gitops appset-rbd-busybox10-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:18:52Z False openshift-gitops appset-rbd-busybox11-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:18:58Z False openshift-gitops appset-rbd-busybox12-placement-drpc 2d9h amagrawa-odf2 Deployed Completed 2024-02-16T10:13:47Z 571.259493ms True openshift-gitops appset-rbd-busybox9-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:19:17Z False Failover remains stuck for multiple apps at WaitForReadiness which leads to application downtime and their inaccessibility. Actual results: [RDR] [Hub recovery] [Neutral] Failover remains stuck with WaitForReadiness Expected results: Failover should complete maintaining permissible RPO/RTO of 2xsync interval. Additional info:
Aman/Benamar, does this issue affects only neural or co-situated also?
verification of this bug will be done as part of Hub recovery testing post 4.15. No regression seen so far. Moving the bug to 4.16.
Based upon Comment27, the fix was tested in ODF 4.16. Tested with following versions: ceph version 18.2.1-188.el9cp (b1ae9c989e2f41dcfec0e680c11d1d9465b1db0e) reef (stable) OCP 4.16.0-0.nightly-2024-05-23-173505 ACM 2.11.0-DOWNSTREAM-2024-05-23-15-16-26 MCE 2.6.0-104 ODF 4.16.0-108.stable Gitops v1.12.3 Platform- VMware When the steps to reproduce is repeated, Failover was successful for all RBD and CephFS workloads and VolumeReplicationClass was successfully restored on the surviving managed cluster (which is needed for RBD). oc get volumereplicationclass -A NAME PROVISIONER rbd-volumereplicationclass-1625360775 openshift-storage.rbd.csi.ceph.com rbd-volumereplicationclass-473128587 openshift-storage.rbd.csi.ceph.com DRPC from new hub- busybox-workloads-101 rbd-sub-busybox101-placement-1-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:26:02Z False busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:27:33Z False busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:27:26Z False busybox-workloads-18 cnv-sub-busybox18-placement-1-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T16:52:14Z False busybox-workloads-5 rbd-sub-busybox5-placement-1-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:25:50Z False busybox-workloads-6 rbd-sub-busybox6-placement-1-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:25:56Z False busybox-workloads-7 rbd-sub-busybox7-placement-1-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:25:34Z False openshift-gitops cephfs-appset-busybox12-placement-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:28:14Z False openshift-gitops cephfs-appset-busybox9-placement-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:28:19Z False openshift-gitops cnv-appset-busybox17-placement-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T16:52:23Z False openshift-gitops rbd-appset-busybox1-placement-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:26:08Z False openshift-gitops rbd-appset-busybox100-placement-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:26:14Z False openshift-gitops rbd-appset-busybox2-placement-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:26:20Z False openshift-gitops rbd-appset-busybox3-placement-drpc 4h51m amagrawa-c1-28my amagrawa-c2-my28 Failover FailedOver Cleaning Up 2024-05-30T15:26:49Z False Since the primary managed cluster is still down, PROGRESSION is reporting Cleaning Up which is expected. Failover was successful on 2 CNV (RBD) workloads cnv-sub-busybox18-placement-1-drpc and cnv-appset-busybox17-placement-drpc as well of both subscription and appset (pull model) types respectively and the data written into the VM was successfully restored after failover completion. Fix for this BZ LGTM. Therefore I am marking this bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days