2248821 – [4.14.z Backport][RDR][Hub Recovery] Fixing DRPC after hub recovery for failover/relocate can lead to data loss

Bug 2248821 - [4.14.z Backport][RDR][Hub Recovery] Fixing DRPC after hub recovery for failover/relocate can lead to data loss

Summary: [4.14.z Backport][RDR][Hub Recovery] Fixing DRPC after hub recovery for failo...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Shyamsundar
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:	2247714
Blocks:
TreeView+	depends on / blocked

Reported:	2023-11-09 08:03 UTC by krishnaram Karthick
Modified:	2024-02-26 19:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2247714
Environment:
Last Closed:	2024-01-02 15:49:57 UTC
Embargoed:

Attachments	(Terms of Use)

Description krishnaram Karthick 2023-11-09 08:03:57 UTC

Cloning this bug for 4.14 backport. 

+++ This bug was initially created as a clone of Bug #2247714 +++

Description of problem (please be detailed as possible and provide log
snippests):
When a workload has failed over or been relocated before Hub Recovery, the DRPC is restored from the hub backup without its previous known status. In this situation, the DRPC attempts to rebuild its status, which may involve generating the PlacementDecision before the managed cluster's restoration of PV/PVC is finished. This can result in a race condition where the application deploys before the restoration of PV/PVC on the managed cluster is completed, leading to the creation of a new PV instead of using the restored one.

Version of all relevant components (if applicable):
4.13
4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
There is a chance of data loss

Is there any workaround available to the best of your knowledge?
There is but not pretty

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Possibly

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
This is not a regression.  It was there all along

Steps to Reproduce:
1. In order to reproduce this reliably, you can stage the target cluster to not have any access to the s3 store. Then recover the hub.



Actual results: New PV/PVC is created instead of restoring them.

Expected results: PV/PVC are restored from s3 stored before the application is redeployed.

--- Additional comment from RHEL Program Management on 2023-11-02 22:39:29 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Sunil Kumar Acharya on 2023-11-06 12:29:23 UTC ---

Moving the Non-blocker BZs out of ODF-4.14.0. If you think this is blocker issue for ODF-4.14.0, feel free to propose it as a blocker with justification note.

--- Additional comment from Karolin Seeger on 2023-11-08 10:14:14 UTC ---

Bringing this one back as a potential blocker for 4.14.z for now.

Comment 8 krishnaram Karthick 2023-12-15 06:03:52 UTC

Moving the bug to 4.14.4 as we are doing a quick 4.14.3 to include a critical fix at RGW (2254303) before to shutdown

Comment 12 Karolin Seeger 2024-01-02 15:49:57 UTC

We decided not to backport co-situated hub recovery issues to z-streams until qualification is complete.
Closing out this clone.

Note You need to log in before you can comment on or make changes to this bug.