Bug 2247714 - [RDR][Hub Recovery] Fixing DRPC after hub recovery for failover/relocate can lead to data loss
Summary: [RDR][Hub Recovery] Fixing DRPC after hub recovery for failover/relocate can ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.15.0
Assignee: Shyamsundar
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On:
Blocks: 2248821
TreeView+ depends on / blocked
 
Reported: 2023-11-02 22:39 UTC by Benamar Mekhissi
Modified: 2024-03-19 15:28 UTC (History)
3 users (show)

Fixed In Version: 4.15.0-102
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2248821 (view as bug list)
Environment:
Last Closed: 2024-03-19 15:28:17 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github RamenDR ramen pull 1165 0 None Merged Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm 2023-12-25 13:25:05 UTC
Github red-hat-storage ramen pull 166 0 None Merged Bug 2247714: Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm 2024-01-02 11:17:38 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:28:20 UTC

Description Benamar Mekhissi 2023-11-02 22:39:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When a workload has failed over or been relocated before Hub Recovery, the DRPC is restored from the hub backup without its previous known status. In this situation, the DRPC attempts to rebuild its status, which may involve generating the PlacementDecision before the managed cluster's restoration of PV/PVC is finished. This can result in a race condition where the application deploys before the restoration of PV/PVC on the managed cluster is completed, leading to the creation of a new PV instead of using the restored one.

Version of all relevant components (if applicable):
4.13
4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
There is a chance of data loss

Is there any workaround available to the best of your knowledge?
There is but not pretty

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Possibly

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
This is not a regression.  It was there all along

Steps to Reproduce:
1. In order to reproduce this reliably, you can stage the target cluster to not have any access to the s3 store. Then recover the hub.



Actual results: New PV/PVC is created instead of restoring them.

Expected results: PV/PVC are restored from s3 stored before the application is redeployed.

Comment 3 Karolin Seeger 2023-11-08 10:14:14 UTC
Bringing this one back as a potential blocker for 4.14.z for now.

Comment 6 Mudit Agarwal 2024-01-02 06:05:47 UTC
Was this backported to 4.15?

Comment 7 Mudit Agarwal 2024-01-02 11:17:38 UTC
Found the backport PR

Comment 12 errata-xmlrpc 2024-03-19 15:28:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.