Bug 2248821

Summary: [4.14.z Backport][RDR][Hub Recovery] Fixing DRPC after hub recovery for failover/relocate can lead to data loss
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: krishnaram Karthick <kramdoss>
Component: odf-drAssignee: Shyamsundar <srangana>
odf-dr sub component: ramen QA Contact: krishnaram Karthick <kramdoss>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: unspecified CC: amagrawa, bmekhiss, kseeger, muagarwa, rtalur, srangana
Version: 4.14   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2247714 Environment:
Last Closed: 2024-01-02 15:49:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2247714    
Bug Blocks:    

Description krishnaram Karthick 2023-11-09 08:03:57 UTC
Cloning this bug for 4.14 backport. 

+++ This bug was initially created as a clone of Bug #2247714 +++

Description of problem (please be detailed as possible and provide log
snippests):
When a workload has failed over or been relocated before Hub Recovery, the DRPC is restored from the hub backup without its previous known status. In this situation, the DRPC attempts to rebuild its status, which may involve generating the PlacementDecision before the managed cluster's restoration of PV/PVC is finished. This can result in a race condition where the application deploys before the restoration of PV/PVC on the managed cluster is completed, leading to the creation of a new PV instead of using the restored one.

Version of all relevant components (if applicable):
4.13
4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
There is a chance of data loss

Is there any workaround available to the best of your knowledge?
There is but not pretty

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Possibly

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
This is not a regression.  It was there all along

Steps to Reproduce:
1. In order to reproduce this reliably, you can stage the target cluster to not have any access to the s3 store. Then recover the hub.



Actual results: New PV/PVC is created instead of restoring them.

Expected results: PV/PVC are restored from s3 stored before the application is redeployed.

--- Additional comment from RHEL Program Management on 2023-11-02 22:39:29 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Sunil Kumar Acharya on 2023-11-06 12:29:23 UTC ---

Moving the Non-blocker BZs out of ODF-4.14.0. If you think this is blocker issue for ODF-4.14.0, feel free to propose it as a blocker with justification note.

--- Additional comment from Karolin Seeger on 2023-11-08 10:14:14 UTC ---

Bringing this one back as a potential blocker for 4.14.z for now.

Comment 8 krishnaram Karthick 2023-12-15 06:03:52 UTC
Moving the bug to 4.14.4 as we are doing a quick 4.14.3 to include a critical fix at RGW (2254303) before to shutdown

Comment 12 Karolin Seeger 2024-01-02 15:49:57 UTC
We decided not to backport co-situated hub recovery issues to z-streams until qualification is complete.
Closing out this clone.