2291126 – [RDR] [Discovered Apps] DRPC stuck in "Cleaning Up" progression during Relocate due to Failed to restore PVs/PVCs with backupFailedValidation

Bug 2291126 - [RDR] [Discovered Apps] DRPC stuck in "Cleaning Up" progression during Relocate due to Failed to restore PVs/PVCs with backupFailedValidation

Summary: [RDR] [Discovered Apps] DRPC stuck in "Cleaning Up" progression during Reloca...

Keywords:
Status:	CLOSED DUPLICATE of bug 2291305
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Shyamsundar
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-06-10 07:18 UTC by Sidhant Agrawal
Modified:	2024-06-14 12:28 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-14 12:28:28 UTC
Embargoed:

Attachments	(Terms of Use)

Description Sidhant Agrawal 2024-06-10 07:18:09 UTC

Description of problem (please be detailed as possible and provide log
snippests):

During Relocate operation, DRPC stuck in "Cleaning Up" progression with below error message in DRPC:
```
      message: 'Failed to restore PVs/PVCs: failed to restore PV/PVC for VolRep (failed
        to restore PVs and PVCs using profile list ([s3profile-sagrawal-c1-ocs-storagecluster
        s3profile-sagrawal-c2-ocs-storagecluster]): backupFailedValidation)'
      observedGeneration: 1
      reason: Error
      status: "False"
      type: ClusterDataReady
```

Version of all relevant components (if applicable):
OCP: 4.16.0-0.nightly-2024-05-30-021120
ODF: 4.16.0-113
ceph version 18.2.1-188.el9cp (b1ae9c989e2f41dcfec0e680c11d1d9465b1db0e) reef (stable)
ACM: 2.11.0-90 (acm-custom-registry:2.11.0-DOWNSTREAM-2024-05-23-15-16-26)
Submariner: 0.18.0 (Globalnet enabled) (iib:722673)
VolSync: 0.9.1
OADP: 1.3.1
GitOps: 1.12.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Configure RDR setup
2. Deploy imperative/Discovered app on C1 (RBD based workload with 20 RWO PVCs)
3. Run IOs for few days
4. Add few sample workloads (Subscription and ApplicationSet based)
Wed Jun  5 05:48:45 UTC 2024
NAMESPACE          NAME                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION        PEER READY
openshift-dr-ops   test-1                              4d16h   sagrawal-c1                                         Deployed       Completed     2024-05-31T13:39:55Z   1.067654336s    True
openshift-gitops   rbd-appset-pull-placement-drpc      7m53s   sagrawal-c1                                         Deployed       Completed     2024-06-05T05:41:22Z   1.051815425s    True
openshift-gitops   rbd-appset-push-placement-drpc      7m36s   sagrawal-c1                                         Deployed       Completed     2024-06-05T05:41:27Z   13.042897361s   True
rbd-subscription   rbd-subscription-placement-1-drpc   7m20s   sagrawal-c1                                         Deployed       Completed     2024-06-05T05:41:43Z   13.046925011s   True

5. Perform failover with the imperative/discovered app deployed in step 2
Wed Jun  5 06:04:33 UTC 2024
NAMESPACE          NAME                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION           PEER READY
openshift-dr-ops   test-1                              4d16h   sagrawal-c1        sagrawal-c2       Failover       FailedOver     Completed     2024-06-05T05:48:56Z   15m28.938971131s   True

6. Perform Relocate with one of Applicationset based application
Wed Jun  5 06:18:47 UTC 2024
NAMESPACE          NAME                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION           PEER READY
openshift-dr-ops   test-1                              4d16h   sagrawal-c1        sagrawal-c2       Failover       FailedOver     Completed     2024-06-05T05:48:56Z   15m28.938971131s   True
openshift-gitops   rbd-appset-pull-placement-drpc      37m     sagrawal-c1                                         Deployed       Completed     2024-06-05T05:41:22Z   1.051815425s       True
openshift-gitops   rbd-appset-push-placement-drpc      37m     sagrawal-c2        sagrawal-c1       Relocate       Relocated      Completed     2024-06-05T06:16:28Z   2m13.967984736s    True
rbd-subscription   rbd-subscription-placement-1-drpc   37m     sagrawal-c1                                         Deployed       Completed     2024-06-05T05:41:43Z   13.046925011s      True

7. Initiate Relocate for imperative/discovered app
Wed Jun  5 07:17:19 UTC 2024
NAMESPACE          NAME                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION          START TIME             DURATION          PEER READY
openshift-dr-ops   test-1                              4d17h   sagrawal-c1        sagrawal-c2       Relocate       Initiating     PreparingFinalSync   2024-06-05T07:17:19Z                     True
openshift-gitops   rbd-appset-pull-placement-drpc      96m     sagrawal-c1                                         Deployed       Completed            2024-06-05T05:41:22Z   1.051815425s      True
openshift-gitops   rbd-appset-push-placement-drpc      96m     sagrawal-c2        sagrawal-c1       Relocate       Relocated      Completed            2024-06-05T06:16:28Z   2m13.967984736s   True
rbd-subscription   rbd-subscription-placement-1-drpc   95m     sagrawal-c1                                         Deployed       Completed            2024-06-05T05:41:43Z   13.046925011s     True
===
Wed Jun  5 07:19:44 UTC 2024
NAMESPACE          NAME                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION           START TIME             DURATION          PEER READY
openshift-dr-ops   test-1                              4d17h   sagrawal-c1        sagrawal-c2       Relocate       Relocating     WaitOnUserToCleanUp   2024-06-05T07:17:19Z                     False

> Run oc delete command on C2 to delete the pods and PVCs

Wed Jun  5 07:21:58 UTC 2024
NAMESPACE          NAME                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
openshift-dr-ops   test-1                              4d17h   sagrawal-c1        sagrawal-c2       Relocate       Relocated      Cleaning Up   2024-06-05T07:17:19Z                     False

8. Observe the following:
On C2:
  - Pods deleted
  - PVCs stuck in Terminating state
  - VR resources both desired and current state shows secondary
  - VRG both desired and current state shows secondary

On C1:
  - VR resources not created
  - VRG desired state is primary and current state "Unknown"
  - PVCs in Bound state
  - Pods in Init:0/1 state with similar error message given below
  MountVolume.MountDevice failed for volume "pvc-361841cd-9211-45ee-83c9-666483d89596" : rpc error: code = Internal desc = fail to check rbd image status: (cannot map image ocs-storagecluster-cephblockpool/csi-vol-e1843895-659c-49a1-bae8-9b7a56564dcc it is not primary)


Actual results:
DRPC stuck in "Cleaning Up" progression

Expected results:
Relocate operation should complete successfully without DRPC getting stuck in Cleaning Up

Note You need to log in before you can comment on or make changes to this bug.