+++ This bug was initially created as a clone of Bug #2224325 +++ Description of problem (please be detailed as possible and provide log snippests): Facing issue while relocating logwriter(Statefulset) app from c2 to c1 managed cluster on MDR 4.13 setup. I applied the workaround to manually delete the terminating logwriter PVCs after initiating relocate, as mentioned here: https://bugzilla.redhat.com/show_bug.cgi?id=2118270#c27. But, PVCs are still stuck in terminating state(oc delete pvc command hangs). oc get drpc -n logwritter-sub-1 -owide NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY logwritter-sub-1-placement-1-drpc 178m pbyregow-clu1 pbyregow-clu2 Relocate Relocating WaitingForResourceRestore 2023-07-20T09:05:17Z oc get pvc -n logwritter-sub-1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE logwriter-cephfs-many Terminating pvc-04a6351c-7d9e-4849-9340-0145de801349 10Gi RWX ocs-external-storagecluster-cephfs 146m logwriter-rbd-logwriter-rbd-0 Terminating pvc-13323f74-5b0e-415e-b11a-6b1d42cbdf45 10Gi RWO ocs-external-storagecluster-ceph-rbd 146m logwriter-rbd-logwriter-rbd-1 Terminating pvc-a3bd7e24-b21e-42dc-9e40-236818c6ed7f 10Gi RWO ocs-external-storagecluster-ceph-rbd 146m logwriter-rbd-logwriter-rbd-2 Terminating pvc-69ba12c8-5292-4524-849c-e3a32715d905 10Gi RWO ocs-external-storagecluster-ceph-rbd 146m oc get vrg logwritter-sub-1-placement-1-drpc -n logwritter-sub-1 NAME DESIREDSTATE CURRENTSTATE logwritter-sub-1-placement-1-drpc secondary Secondary Version of all relevant components (if applicable): ODF/MCO: 4.13.0 ACM: 2.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? no Is there any workaround available to the best of your knowledge? will be updated Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: yes Steps to Reproduce: 1. Configure MDR cluster 2. Create stateful set subscription/application based app 3. Failover the app from c1 to c2 4. Initiate relocate and apply the the known WA https://bugzilla.redhat.com/show_bug.cgi?id=2118270#c27 Actual results: STS application gets stuck in Relocating state Expected results: STS application should be relocated Additional info: --- Additional comment from Parikshith on 2023-07-20 12:23:37 UTC --- logs at http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/logwriter/ --- Additional comment from RHEL Program Management on 2023-07-20 12:23:47 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from Shyamsundar on 2023-07-20 12:33:23 UTC --- The issue is as follows: - VRG on PVC restore finds an existing PVC (as STS PVCs are not deleted post Failover to reduce user management of the same) - VRG further determines the PVC as not being restored by Ramen (restore annotation is missing, as this was the initial PVC created on Primary before a failover) - VRG loops on reconcile returning errors and not marking ClusterDataReady Logs and analysis: - DRPC reports "WaitingForResourceRestore" - VRG status on preferredCluster reports - lastTransitionTime: "2023-07-20T09:06:50Z" message: 'Failed to restore PVs (failed to restore ClusterData for VolRep (failed to restore PVs and PVCs using profile list ([s3profile-pbyregow-clu1-ocs-external-storagecluster s3profile-pbyregow-clu2-ocs-external-storagecluster]): failed to restore all []client.Object. Total/Restored 4/1))' observedGeneration: 1 reason: Error status: "False" type: ClusterDataReady - VRG on preferredCluster is not progressing on restore with the following errors in the log: 2023-07-20T08:56:16.191Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1772 Warning: Mismatch in PV/PVC count 4/1 (failed to restore all []client.Object. Total/Restored 4/1) {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"} 2023-07-20T08:56:16.422Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1800 Found 4 PVs in s3 store using profile s3profile-pbyregow-clu2-ocs-external-storagecluster {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"} 2023-07-20T08:56:16.429Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1952 Existing PV matches and is bound to the same claim {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PV": "pvc-0e63cdbd-d38f-47a0-b506-c361ffa7c5b0"} 2023-07-20T08:56:16.437Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1952 Existing PV matches and is bound to the same claim {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PV": "pvc-5e68b0be-64f5-4065-b291-58d2215a885d"} 2023-07-20T08:56:16.442Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1952 Existing PV matches and is bound to the same claim {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PV": "pvc-d9800583-9367-408b-ac9f-ce7ee4943a98"} 2023-07-20T08:56:16.448Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1952 Existing PV matches and is bound to the same claim {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PV": "pvc-dcbbb474-d228-4a1e-8092-d76980153da3"} 2023-07-20T08:56:16.448Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1911 Restored 4 PV for VolRep {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"} 2023-07-20T08:56:16.591Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1826 Found 4 PVCs in s3 store using profile s3profile-pbyregow-clu2-ocs-external-storagecluster {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"} 2023-07-20T08:56:16.596Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:2000 PVC exists and managed by Ramen {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "PVC": {"apiVersion": "v1", "kind": "PersistentVolumeClaim", "namespace": "appset-logwriter-app-1", "name": "logwriter-cephfs-many"}} 2023-07-20T08:56:16.601Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1888 Object exists. Ignoring and moving to next object {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "error": "found PVC object not restored by Ramen for PVC logwriter-rbd-logwriter-rbd-0"} 2023-07-20T08:56:16.606Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1888 Object exists. Ignoring and moving to next object {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "error": "found PVC object not restored by Ramen for PVC logwriter-rbd-logwriter-rbd-1"} 2023-07-20T08:56:16.610Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1888 Object exists. Ignoring and moving to next object {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary", "error": "found PVC object not restored by Ramen for PVC logwriter-rbd-logwriter-rbd-2"} 2023-07-20T08:56:16.610Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1772 Warning: Mismatch in PV/PVC count 4/1 (failed to restore all []client.Object. Total/Restored 4/1) {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"} 2023-07-20T08:56:16.610Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1721 failed to restore PVs and PVCs using profile list ([s3profile-pbyregow-clu1-ocs-external-storagecluster s3profile-pbyregow-clu2-ocs-external-storagecluster]) {"VolumeReplicationGroup": "appset-logwriter-app-1/logwriter-app-1-placement-drpc", "rid": "f67001dc-c106-4f36-af4b-277583c39d17", "State": "primary"} --- Additional comment from Shyamsundar on 2023-07-20 12:35:48 UTC --- Workaround: - Delete PVCs on the preferredCluster when stuck in this phase as per DRPC on relocate: WaitingForResourceRestore - Well technically we should ensure it is the PVC restore that is causing the error, and not blindly delete the PVCs. So steps in that regard would be to delete PVCs that do not have the restored by ramen annotation - Relocate will make the required progress QE tested the above and ensured that this works as desired. --- Additional comment from Shyamsundar on 2023-07-20 12:36:32 UTC --- WIP upstream PR: https://github.com/RamenDR/ramen/pull/995 --- Additional comment from RHEL Program Management on 2023-07-21 06:49:05 UTC --- This BZ is being approved for ODF 4.14.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.14.0 --- Additional comment from RHEL Program Management on 2023-07-21 06:49:05 UTC --- Since this bug has been approved for ODF 4.14.0 release, through release flag 'odf-4.14.0+', the Target Release is being set to 'ODF 4.14.0 --- Additional comment from Harish NV Rao on 2023-07-21 06:56:06 UTC --- (In reply to Shyamsundar from comment #4) > Workaround: > > - Delete PVCs on the preferredCluster when stuck in this phase as per DRPC > on relocate: WaitingForResourceRestore > - Well technically we should ensure it is the PVC restore that is causing > the error, and not blindly delete the PVCs. So steps in that regard would be > to delete PVCs that do not have the restored by ramen annotation > - Relocate will make the required progress > > QE tested the above and ensured that this works as desired. This needs to be fixed in 4.13.z. Until then it should be part of 4.13 RN as known issue. Shyam, IMO this bz needs to be cloned for 4.13.z and make it part of RN till fixed. Is this fine? --- Additional comment from Harish NV Rao on 2023-08-01 06:07:04 UTC --- (In reply to Harish NV Rao from comment #8) > (In reply to Shyamsundar from comment #4) > > Workaround: > > > > - Delete PVCs on the preferredCluster when stuck in this phase as per DRPC > > on relocate: WaitingForResourceRestore > > - Well technically we should ensure it is the PVC restore that is causing > > the error, and not blindly delete the PVCs. So steps in that regard would be > > to delete PVCs that do not have the restored by ramen annotation > > - Relocate will make the required progress > > > > QE tested the above and ensured that this works as desired. > > This needs to be fixed in 4.13.z. Until then it should be part of 4.13 RN as > known issue. > > Shyam, IMO this bz needs to be cloned for 4.13.z and make it part of RN till > fixed. Is this fine? I am setting doc type as known issue for this bz so it can get into 4.13 RN. --- Additional comment from errata-xmlrpc on 2023-08-03 06:57:25 UTC --- This bug has been added to advisory RHBA-2023:115514 by ceph-build service account (ceph-build.COM) --- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:57 UTC --- Account disabled by LDAP Audit --- Additional comment from Raghavendra Talur on 2023-09-25 12:44:53 UTC --- Rtalur to be update the test procedure for this bug. --- Additional comment from avdhoot on 2023-10-02 07:36:13 UTC --- hi rtalur I observed after applying Workaround on secondary I am able to relocate apps on primary. Not required to delete PVCs on the preferredCluster as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2224325#c4 OCP- 4.14.0 ODF- 4.14.0-139 ACM- 2.9.0-165 --- Additional comment from avdhoot on 2023-10-03 06:43:01 UTC --- Marking it as verified as I am able to relocate STS app using step mentioned in description. --- Additional comment from errata-xmlrpc on 2023-11-08 17:53:54 UTC --- Bug report changed to RELEASE_PENDING status by Errata System. Advisory RHSA-2023:115514-11 has been changed to PUSH_READY status. https://errata.devel.redhat.com/advisory/115514 --- Additional comment from errata-xmlrpc on 2023-11-08 18:52:48 UTC --- Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.8 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:1657