2264402 – [4.13 clone][MDR] Not able to relocate STS based applications

Bug 2264402 - [4.13 clone][MDR] Not able to relocate STS based applications

Summary: [4.13 clone][MDR] Not able to relocate STS based applications

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.13.8
Assignee:	rakesh-gm
QA Contact:	Parikshith
Docs Contact:
URL:
Whiteboard:
Depends On:	2224325
Blocks:	2244409
TreeView+	depends on / blocked

Reported:	2024-02-15 10:22 UTC by Parikshith
Modified:	2024-04-03 07:03 UTC (History)
CC List:	9 users (show)
Fixed In Version:	4.13.8-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2224325
Environment:
Last Closed:	2024-04-03 07:03:21 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ramen pull 196	0	None	open	Bug 2264402: Add checks to ensure existing PVC matches PVC to restore	2024-02-20 11:56:36 UTC
Red Hat Product Errata	RHBA-2024:1657	0	None	None	None	2024-04-03 07:03:29 UTC

Description Parikshith 2024-02-15 10:22:14 UTC

+++ This bug was initially created as a clone of Bug #2224325 +++

Description of problem (please be detailed as possible and provide log
snippests):

Facing issue while relocating logwriter(Statefulset) app from c2 to c1 managed cluster on MDR 4.13 setup. I applied the workaround to manually delete the terminating logwriter PVCs after initiating relocate, as mentioned here: https://bugzilla.redhat.com/show_bug.cgi?id=2118270#c27. But, PVCs are still stuck in terminating state(oc delete pvc command hangs).

oc get drpc -n logwritter-sub-1 -owide
NAME                                AGE    PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION                 START TIME             DURATION   PEER READY
logwritter-sub-1-placement-1-drpc   178m   pbyregow-clu1      pbyregow-clu2     Relocate       Relocating     WaitingForResourceRestore   2023-07-20T09:05:17Z


oc get pvc -n logwritter-sub-1
NAME                            STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
logwriter-cephfs-many           Terminating   pvc-04a6351c-7d9e-4849-9340-0145de801349   10Gi       RWX            ocs-external-storagecluster-cephfs     146m
logwriter-rbd-logwriter-rbd-0   Terminating   pvc-13323f74-5b0e-415e-b11a-6b1d42cbdf45   10Gi       RWO            ocs-external-storagecluster-ceph-rbd   146m
logwriter-rbd-logwriter-rbd-1   Terminating   pvc-a3bd7e24-b21e-42dc-9e40-236818c6ed7f   10Gi       RWO            ocs-external-storagecluster-ceph-rbd   146m
logwriter-rbd-logwriter-rbd-2   Terminating   pvc-69ba12c8-5292-4524-849c-e3a32715d905   10Gi       RWO            ocs-external-storagecluster-ceph-rbd   146m

oc get vrg logwritter-sub-1-placement-1-drpc -n logwritter-sub-1
NAME                                DESIREDSTATE   CURRENTSTATE
logwritter-sub-1-placement-1-drpc   secondary      Secondary

Version of all relevant components (if applicable):
ODF/MCO: 4.13.0
ACM: 2.8

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
no

Is there any workaround available to the best of your knowledge?
will be updated

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
yes

Steps to Reproduce:
1. Configure MDR cluster
2. Create stateful set subscription/application based app
3. Failover the app from c1 to c2
4. Initiate relocate and apply the the known WA https://bugzilla.redhat.com/show_bug.cgi?id=2118270#c27 


Actual results:
STS application gets stuck in Relocating state

Expected results:
STS application should be relocated

Additional info:

--- Additional comment from Parikshith on 2023-07-20 12:23:37 UTC ---

logs at http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/logwriter/

--- Additional comment from RHEL Program Management on 2023-07-20 12:23:47 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Shyamsundar on 2023-07-20 12:33:23 UTC ---

The issue is as follows:

- VRG on PVC restore finds an existing PVC (as STS PVCs are not deleted post Failover to reduce user management of the same)
- VRG further determines the PVC as not being restored by Ramen (restore annotation is missing, as this was the initial PVC created on Primary before a failover)
- VRG loops on reconcile returning errors and not marking ClusterDataReady

Logs and analysis:

- DRPC reports "WaitingForResourceRestore"
- VRG status on preferredCluster reports

    - lastTransitionTime: "2023-07-20T09:06:50Z"
      message: 'Failed to restore PVs (failed to restore ClusterData for VolRep (failed
        to restore PVs and PVCs using profile list ([s3profile-pbyregow-clu1-ocs-external-storagecluster
        s3profile-pbyregow-clu2-ocs-external-storagecluster]): failed to restore all
        []client.Object. Total/Restored 4/1))'
      observedGeneration: 1
      reason: Error
      status: "False"
      type: ClusterDataReady

- VRG on preferredCluster is not progressing on restore with the following errors in the log:

2023-07-20T08:56:16.191Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1772  Warning: Mismatch in PV/PVC count 4/1 (failed to restore all []client.Object. Total/Restored 4/1) {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"}
2023-07-20T08:56:16.422Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1800  Found 4 PVs in s3 store using profile s3profile-pbyregow-clu2-ocs-external-storagecluster {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"}
2023-07-20T08:56:16.429Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1952  Existing PV matches and is bound to the same claim        {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PV": "pvc-0e63cdbd-d38f-47a0-b506-c361ffa7c5b0"}
2023-07-20T08:56:16.437Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1952  Existing PV matches and is bound to the same claim        {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PV": "pvc-5e68b0be-64f5-4065-b291-58d2215a885d"}
2023-07-20T08:56:16.442Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1952  Existing PV matches and is bound to the same claim        {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PV": "pvc-d9800583-9367-408b-ac9f-ce7ee4943a98"}
2023-07-20T08:56:16.448Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1952  Existing PV matches and is bound to the same claim        {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PV": "pvc-dcbbb474-d228-4a1e-8092-d76980153da3"}
2023-07-20T08:56:16.448Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1911  Restored 4 PV for VolRep        {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"}
2023-07-20T08:56:16.591Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1826  Found 4 PVCs in s3 store using profile s3profile-pbyregow-clu2-ocs-external-storagecluster        {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"}
2023-07-20T08:56:16.596Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:2000  PVC exists and managed by Ramen   {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PVC": {"apiVersion": "v1", "kind": "PersistentVolumeClaim", "namespace": "appset-logwriter-app-1", "name": "logwriter-cephfs-many"}}
2023-07-20T08:56:16.601Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1888  Object exists. Ignoring and moving to next object {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "error": "found PVC object not restored by Ramen for PVC logwriter-rbd-logwriter-rbd-0"}
2023-07-20T08:56:16.606Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1888  Object exists. Ignoring and moving to next object {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "error": "found PVC object not restored by Ramen for PVC logwriter-rbd-logwriter-rbd-1"}
2023-07-20T08:56:16.610Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1888  Object exists. Ignoring and moving to next object {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "error": "found PVC object not restored by Ramen for PVC logwriter-rbd-logwriter-rbd-2"}
2023-07-20T08:56:16.610Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1772  Warning: Mismatch in PV/PVC count 4/1 (failed to restore all []client.Object. Total/Restored 4/1) {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"}
2023-07-20T08:56:16.610Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1721  failed to restore PVs and PVCs using profile list ([s3profile-pbyregow-clu1-ocs-external-storagecluster s3profile-pbyregow-clu2-ocs-external-storagecluster])     {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"}

--- Additional comment from Shyamsundar on 2023-07-20 12:35:48 UTC ---

Workaround:

- Delete PVCs on the preferredCluster when stuck in this phase as per DRPC on relocate: WaitingForResourceRestore
  - Well technically we should ensure it is the PVC restore that is causing the error, and not blindly delete the PVCs. So steps in that regard would be to delete PVCs that do not have the restored by ramen annotation
- Relocate will make the required progress

QE tested the above and ensured that this works as desired.

--- Additional comment from Shyamsundar on 2023-07-20 12:36:32 UTC ---

WIP upstream PR: https://github.com/RamenDR/ramen/pull/995

--- Additional comment from RHEL Program Management on 2023-07-21 06:49:05 UTC ---

This BZ is being approved for ODF 4.14.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.14.0

--- Additional comment from RHEL Program Management on 2023-07-21 06:49:05 UTC ---

Since this bug has been approved for ODF 4.14.0 release, through release flag 'odf-4.14.0+', the Target Release is being set to 'ODF 4.14.0

--- Additional comment from Harish NV Rao on 2023-07-21 06:56:06 UTC ---

(In reply to Shyamsundar from comment #4)
> Workaround:
> 
> - Delete PVCs on the preferredCluster when stuck in this phase as per DRPC
> on relocate: WaitingForResourceRestore
>   - Well technically we should ensure it is the PVC restore that is causing
> the error, and not blindly delete the PVCs. So steps in that regard would be
> to delete PVCs that do not have the restored by ramen annotation
> - Relocate will make the required progress
> 
> QE tested the above and ensured that this works as desired.

This needs to be fixed in 4.13.z. Until then it should be part of 4.13 RN as known issue.

Shyam, IMO this bz needs to be cloned for 4.13.z and make it part of RN till fixed. Is this fine?

--- Additional comment from Harish NV Rao on 2023-08-01 06:07:04 UTC ---

(In reply to Harish NV Rao from comment #8)
> (In reply to Shyamsundar from comment #4)
> > Workaround:
> > 
> > - Delete PVCs on the preferredCluster when stuck in this phase as per DRPC
> > on relocate: WaitingForResourceRestore
> >   - Well technically we should ensure it is the PVC restore that is causing
> > the error, and not blindly delete the PVCs. So steps in that regard would be
> > to delete PVCs that do not have the restored by ramen annotation
> > - Relocate will make the required progress
> > 
> > QE tested the above and ensured that this works as desired.
> 
> This needs to be fixed in 4.13.z. Until then it should be part of 4.13 RN as
> known issue.
> 
> Shyam, IMO this bz needs to be cloned for 4.13.z and make it part of RN till
> fixed. Is this fine?
I am setting doc type as known issue for this bz so it can get into 4.13 RN.

--- Additional comment from errata-xmlrpc on 2023-08-03 06:57:25 UTC ---

This bug has been added to advisory RHBA-2023:115514 by ceph-build service account (ceph-build.COM)

--- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:57 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Raghavendra Talur on 2023-09-25 12:44:53 UTC ---

Rtalur to be update the test procedure for this bug.

--- Additional comment from avdhoot on 2023-10-02 07:36:13 UTC ---

hi rtalur

I observed after applying Workaround on secondary I am able to
relocate apps on primary. Not required to delete PVCs on the preferredCluster as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2224325#c4

OCP- 4.14.0
ODF- 4.14.0-139
ACM- 2.9.0-165

--- Additional comment from avdhoot on 2023-10-03 06:43:01 UTC ---

Marking it as verified as I am able to relocate STS app using step mentioned in description.

--- Additional comment from errata-xmlrpc on 2023-11-08 17:53:54 UTC ---

Bug report changed to RELEASE_PENDING status by Errata System.
Advisory RHSA-2023:115514-11 has been changed to PUSH_READY status.
https://errata.devel.redhat.com/advisory/115514

--- Additional comment from errata-xmlrpc on 2023-11-08 18:52:48 UTC ---

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Comment 11 errata-xmlrpc 2024-04-03 07:03:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.8 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:1657

Note You need to log in before you can comment on or make changes to this bug.