Description of problem (please be detailed as possible and provide log snippets): [RDR] Failover of workload does not happen when primary cluster is DOWN Version of all relevant components (if applicable): OCP version:- 4.10 ACM version:- 2.5 ODF version:- 4.10.3-2 Ceph version:- ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Yes Steps to Reproduce: 1. Deploy RDR cluster 2. Run workload in multiple namespace 3. Power off the node for 4-5 hr 4. Perform failover of workload after 4-5 hr Actual results: Workload is not failover to failover cluster Expected results: Failover should work Additional info:
In 4.10 DRPolicy reconciler validates that the s3store is reachable and accessible in very reconciliation (this validation has been moved in 4.11 to the DRCluster reconcile). In this setup, every managed cluster has an s3store that should be accessible by all managed clusters. When one cluster is unreachable, the validation will fail and no forward progress is made.
We will fix it, as the failover often occurs due to the primary cluster failure and might not be reachable.
PR posted, awaiting required acks to merge: https://github.com/red-hat-storage/ramen/pull/39
If I am not wrong, this is a must fix for 4.11 (even for TP). Can we have some ETA for the fix?
Yes, needed for 4.11. @benamar we need to forward port https://github.com/red-hat-storage/ramen/pull/39 possibly in a better way to future proof it in 4.11. Assigning this to you.
This BZ is fixed due to the split in DRPolicy resource into DRPolicy and DRCluster. Currently DRPolicy validated condition is not dependent on any of the DRClusters being valid (other than their existence), thus on failover when a DRCluster reports s3 connectivity loss, it is not considered a blocking condition by DRPC to fail the workload over. Marking this ON_QA, for testing the behavior as required.
Please test with any of the latest 4.11 builds.