Bug 2247714

Summary: [RDR][Hub Recovery] Fixing DRPC after hub recovery for failover/relocate can lead to data loss
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Benamar Mekhissi <bmekhiss>
Component: odf-drAssignee: Shyamsundar <srangana>
odf-dr sub component: ramen QA Contact: Aman Agrawal <amagrawa>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: amagrawa, kseeger, muagarwa
Version: 4.14   
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-102 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2248821 (view as bug list) Environment:
Last Closed: 2024-03-19 15:28:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2248821    

Description Benamar Mekhissi 2023-11-02 22:39:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When a workload has failed over or been relocated before Hub Recovery, the DRPC is restored from the hub backup without its previous known status. In this situation, the DRPC attempts to rebuild its status, which may involve generating the PlacementDecision before the managed cluster's restoration of PV/PVC is finished. This can result in a race condition where the application deploys before the restoration of PV/PVC on the managed cluster is completed, leading to the creation of a new PV instead of using the restored one.

Version of all relevant components (if applicable):
4.13
4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
There is a chance of data loss

Is there any workaround available to the best of your knowledge?
There is but not pretty

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Possibly

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
This is not a regression.  It was there all along

Steps to Reproduce:
1. In order to reproduce this reliably, you can stage the target cluster to not have any access to the s3 store. Then recover the hub.



Actual results: New PV/PVC is created instead of restoring them.

Expected results: PV/PVC are restored from s3 stored before the application is redeployed.

Comment 3 Karolin Seeger 2023-11-08 10:14:14 UTC
Bringing this one back as a potential blocker for 4.14.z for now.

Comment 6 Mudit Agarwal 2024-01-02 06:05:47 UTC
Was this backported to 4.15?

Comment 7 Mudit Agarwal 2024-01-02 11:17:38 UTC
Found the backport PR

Comment 12 errata-xmlrpc 2024-03-19 15:28:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383