Bug 2090080 - [RDR] Failover of workload does not happen when primary cluster is DOWN [NEEDINFO]
Summary: [RDR] Failover of workload does not happen when primary cluster is DOWN
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.11.0
Assignee: Benamar Mekhissi
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On:
Blocks: 2090568
TreeView+ depends on / blocked
 
Reported: 2022-05-25 05:30 UTC by Pratik Surve
Modified: 2023-08-09 17:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2090568 (view as bug list)
Environment:
Last Closed:
Embargoed:
prsurve: needinfo? (bmekhiss)


Attachments (Terms of Use)

Description Pratik Surve 2022-05-25 05:30:46 UTC
Description of problem (please be detailed as possible and provide log
snippets):

[RDR] Failover of workload does not happen when primary cluster is DOWN


Version of all relevant components (if applicable):

OCP version:- 4.10
ACM version:- 2.5
ODF version:- 4.10.3-2
Ceph version:- ceph version 16.2.7-112.el8cp 
(e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1. Deploy RDR cluster
2. Run workload in multiple namespace
3. Power off the node for 4-5 hr
4. Perform failover of workload after 4-5 hr



Actual results:
Workload is not failover to failover cluster


Expected results:
Failover should work


Additional info:

Comment 4 Benamar Mekhissi 2022-05-25 14:13:00 UTC
In 4.10 DRPolicy reconciler validates that the  s3store is reachable and accessible in very reconciliation (this validation has been moved in 4.11 to the DRCluster reconcile). In this setup, every managed cluster has an s3store that should be accessible by all managed clusters. When one cluster is unreachable, the validation will fail and no forward progress is made.

Comment 5 Benamar Mekhissi 2022-05-25 14:15:36 UTC
We will fix it, as the failover often occurs due to the primary cluster failure and might not be reachable.

Comment 6 Shyamsundar 2022-05-25 19:41:55 UTC
PR posted, awaiting required acks to merge: https://github.com/red-hat-storage/ramen/pull/39

Comment 10 Mudit Agarwal 2022-06-29 13:31:13 UTC
If I am not wrong, this is a must fix for 4.11 (even for TP). Can we have some ETA for the fix?

Comment 11 Shyamsundar 2022-06-29 13:51:57 UTC
Yes, needed for 4.11.

@benamar we need to forward port https://github.com/red-hat-storage/ramen/pull/39 possibly in a better way to future proof it in 4.11. Assigning this to you.

Comment 12 Shyamsundar 2022-07-18 13:19:53 UTC
This BZ is fixed due to the split in DRPolicy resource into DRPolicy and DRCluster. Currently DRPolicy validated condition is not dependent on any of the DRClusters being valid (other than their existence), thus on failover when a DRCluster reports s3 connectivity loss, it is not considered a blocking condition by DRPC to fail the workload over.

Marking this ON_QA, for testing the behavior as required.

Comment 13 Mudit Agarwal 2022-07-19 07:58:49 UTC
Please test with any of the latest 4.11 builds.


Note You need to log in before you can comment on or make changes to this bug.