2304182 – [RDR] [Hub recovery] [Co-situated] Unable to resolve DRPC State when the backed-up state differs from the VRG state

Bug 2304182 - [RDR] [Hub recovery] [Co-situated] Unable to resolve DRPC State when the backed-up state differs from the VRG state [NEEDINFO]

Summary: [RDR] [Hub recovery] [Co-situated] Unable to resolve DRPC State when the back...

Keywords:
Status:	ON_QA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.18.0
Assignee:	Benamar Mekhissi
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-08-12 18:29 UTC by Aman Agrawal
Modified:	2024-10-28 13:44 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.17.0-118
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	muagarwa: needinfo? (bmekhiss) sheggodu: needinfo? (bmekhiss)

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	RamenDR ramen pull 1584	None	open	Disallow relocation execution when a cluster is unreachable	2024-10-04 17:42:54 UTC
Github	red-hat-storage ramen pull 370	None	open	Bug 2304182: Disallow relocation execution when a cluster is unreachable	2024-10-07 17:22:17 UTC
Red Hat Issue Tracker	OCSBZM-8852	None	None	None	2024-08-26 10:49:39 UTC

Description Aman Agrawal 2024-08-12 18:29:37 UTC

Description of problem (please be detailed as possible and provide log
snippests): This BZ is an extension of BZ2302144 (meaning it is one of the issues observed while executing/filing BZ2302144) but will be tracked separately in this BZ.


Version of all relevant components (if applicable):

Platform- VMware

OCP 4.16.0-0.nightly-2024-07-29-013917
ACM 2.11.1 GA'ed
MCE 2.6.1
OADP 1.4.0
ODF 4.16 GA'ed
Gitops 1.13.1
ceph version 18.2.1-194.el9cp (04a992766839cd3207877e518a1238cdbac3787e) reef (stable)
Submariner 0.18.0
VolSync 0.9.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a RDR setup with multiple workloads, rbd-appset(pull)/sub, cephfs-appset(pull)/sub, imperative app all in Deployed, FailedOver and Relocated state running on both the managed clusters, configure it for hub recovery but do not start taking new backups.
2. Before backups are taken, ensure the above state is achieved.
3. Now start taking backups and when we have 1 or 2 successful backups, either stop the backup or increase the backup time to allow certain action in between so that no new backup is taken.
Collect must-gather and all other observations. 
4. Now move the workloads across managed clusters and achieve the same state as in Step 1.
Meaning, move the workloads which are primary on C1 to C2, and vice-versa. Let the workloads in Deployed state remain as it is.
5. Make sure this latest state of workloads and drpc is **NOT** backed up as mentioned in Step 3 above. 

Collect must-gather along with drpc state.

Now perform site-failure (bring any of the managed cluster down along with the active hub cluster but ensure that there are multiple workloads on both the managed clusters and in the same state), then perform hub recovery.

In my case, cluster C1 (amagrawa-12jul-c1) went down during site-failure.

6. After moving to new hub, ensure drpolicy is validated and drpc is restored.
7. Check the drpc status (it should match with the last backed up state of drpc as in Step 3 above).
8. Check the deployment and pvc status of various workloads on the surviving managed cluster.
9. After a few hours, recover the down managed cluster C1 and ensure it's successfully imported on the RHACM console of the new hub.
10. Repeat step 7 and 8.


Actual results:

================================================================================================================================================================
DRPC state when backup was taken:


busybox-workloads-13   cephfs-sub-busybox13-placement-1-drpc    6d1h   amagrawa-12jul-c2   amagrawa-12jul-c1   Relocate       Relocated      Completed     2024-07-29T09:41:28Z   32m29.415869422s     True


openshift-gitops       cephfs-appset-busybox11-placement-drpc   6d1h   amagrawa-12jul-c2   amagrawa-12jul-c1   Relocate       Relocated      Completed     2024-07-29T09:41:11Z   5m44.774301626s      True


================================================================================================================================================================
DRPC state after backup was stopped:


busybox-workloads-13   cephfs-sub-busybox13-placement-1-drpc    6d15h   amagrawa-12jul-c2   amagrawa-12jul-c1   Failover       FailedOver     Completed     2024-07-31T08:01:33Z   4m19.917032579s    True


openshift-gitops       cephfs-appset-busybox11-placement-drpc   6d15h   amagrawa-12jul-c2   amagrawa-12jul-c1   Failover       FailedOver     Completed     2024-07-31T08:00:55Z   4m27.721755836s    True


================================================================================================================================================================
DRPC state after hub recovery:

PEER READY became False and CURRENTSTATE reports Initiating but Relocate can not be performed as one of the managed cluster C1 is down



busybox-workloads-13   cephfs-sub-busybox13-placement-1-drpc    9h    amagrawa-12jul-c2   amagrawa-12jul-c1   Relocate       Initiating                            2024-07-31T10:33:25Z                  False


openshift-gitops       cephfs-appset-busybox11-placement-drpc   9h    amagrawa-12jul-c2   amagrawa-12jul-c1   Relocate       Initiating                            2024-07-31T10:33:22Z                  False

================================================================================================================================================================
DRPC state after the down managed cluster is recovered and successfully imported on the RHACM console of the new hub:


PEER READY is still False and CURRENTSTATE reports Initiating and remains stuck in the same state Forever


busybox-workloads-13   cephfs-sub-busybox13-placement-1-drpc    11d   amagrawa-12jul-c2   amagrawa-12jul-c1   Relocate       Initiating                           2024-07-31T10:33:25Z                          False


openshift-gitops       cephfs-appset-busybox11-placement-drpc   11d   amagrawa-12jul-c2   amagrawa-12jul-c1   Relocate       Initiating                           2024-07-31T10:33:22Z                          False

================================================================================================================================================================

Logs collected before the backup was stopped (when all operations had successfully completed)- 
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-before-backup-stopped/



Logs collected before performing hub recovery- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-before-hub-recovery/



Logs collected after performing hub recovery- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-after-hub-recovery/




Expected results: After the down managed cluster is recovered and successfully imported on the RHACM console of the new hub, CURRENTSTATE for these workloads should report WaitForUser with PEER READY as True and admin should be able to relocate/failover them to the C2 managed cluster.


Additional info:

Note You need to log in before you can comment on or make changes to this bug.