2279260 – [RDR] [Hub recovery] [Co-situated] Relocate operation and cleanup after failover remains stuck during the eviction period timeout

Bug 2279260 - [RDR] [Hub recovery] [Co-situated] Relocate operation and cleanup after failover remains stuck during the eviction period timeout

Summary: [RDR] [Hub recovery] [Co-situated] Relocate operation and cleanup after failo...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Benamar Mekhissi
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-06 07:41 UTC by Aman Agrawal
Modified:	2025-05-08 07:28 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-8322	0	None	None	None	2024-11-04 06:40:00 UTC

Description Aman Agrawal 2024-05-06 07:41:50 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
ACM 2.10.2 GA'ed
MCE 2.5.2
ODF 4.15.2-1 GA'ed
ceph version 17.2.6-209.el9cp (e9529323dd7ab3b0e8cdf84e17a1b58c2b42948c) quincy (stable)
OCP 4.15.0-0.nightly-2024-04-30-234425
Submariner 0.17.1 GA'ed
VolSync 0.9.1


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
*****Active hub co-situated with primary managed cluster*****

1. When we have multiple workloads (RBD and CephFS) of both subscription and appset types (pull model) in Deployed state running on primary managed cluster (C1) which goes down along with 
active hub cluster during site failure at site-1, perform hub recovery and move to passive hub at site-2 (which is co-situated with secondary managed cluster C2).
2. Ensure the available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated.
2. After DRPC is restored, recover the down managed cluster C1 and ensure it's successfully imported on the RHACM console.
4. Let IOs continue for some time (30mins-1hr) and ensure data sync is progressing well.
5. Now failover some of the workloads (with both managed clusters up and running) and relocate remaining ones to the C2 managed cluster during the eviction period timeout (which is currently set to 24hrs).


Actual results: [RDR] [Hub recovery] [Co-situated] Relocate operation and cleanup after failover remains stuck during the eviction period timeout


Hub-

oc get drpc -o wide -A
NAMESPACE               NAME                                      AGE   PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION                   START TIME             DURATION   PEER READY
busybox-workloads-101   rbd-sub-busybox101-placement-1-drpc       14h   amagrawa-c2-13apr   amagrawa-c1-13apr   Relocate       Relocating     EnsuringVolumesAreSecondary   2024-05-05T17:32:04Z              False
busybox-workloads-103   cephfs-sub-busybox103-placement-1-drpc    14h   amagrawa-c2-13apr   amagrawa-c1-13apr   Relocate       Relocating     RunningFinalSync              2024-05-05T17:31:53Z              True
openshift-gitops        cephfs-appset-busybox102-placement-drpc   14h   amagrawa-c2-13apr   amagrawa-c1-13apr   Relocate       Relocating     RunningFinalSync              2024-05-05T17:31:45Z              True
openshift-gitops        rbd-appset-busybox100-placement-drpc      14h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Cleaning Up                   2024-05-05T17:31:57Z              False


Failover for rbd-appset-busybox100-placement-drpc worked but cleanup is stuck, however relocate of all other workloads in stuck.


failedover/relocated from C1 to C2-


C2-


oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-100
NAME                                                                AGE   VOLUMEREPLICATIONCLASS                 PVCNAME          DESIREDSTATE   CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc-41   14h   rbd-volumereplicationclass-473128587   busybox-pvc-41   primary        Primary

NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE   VOLUMEMODE
persistentvolumeclaim/busybox-pvc-41   Bound    pvc-7c5e424d-b75a-495d-8745-4d3220fc48e6   42Gi       RWO            ocs-storagecluster-ceph-rbd   14h   Filesystem

NAME                                                                               DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox100-placement-drpc   primary        Primary

NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
pod/busybox-41-5c55b45d49-qh7v8   1/1     Running   0          13h   10.129.2.51   compute-2   <none>           <none>




oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-101
No resources found in busybox-workloads-101 namespace.




oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-102
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-542ab575-db38-4187-bbb0-70697ea232f3   94Gi       RWX            ocs-storagecluster-cephfs   3d16h   Filesystem

NAME                                                                                  DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox102-placement-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-mw285   1/1     Running   0          2m41s   10.129.2.156   compute-2   <none>           <none>




oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-103
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-3256f1e5-79e3-43ff-96cb-e0b727ffcc74   94Gi       RWX            ocs-storagecluster-cephfs   3d16h   Filesystem

NAME                                                                                 DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox103-placement-1-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-2q7zp   1/1     Running   0          2m46s   10.129.2.155   compute-2   <none>           <none>



C1-


oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-100
NAME                                                                AGE     VOLUMEREPLICATIONCLASS                 PVCNAME          DESIREDSTATE   CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc-41   3d16h   rbd-volumereplicationclass-473128587   busybox-pvc-41   primary        Primary

NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-41   Bound    pvc-7c5e424d-b75a-495d-8745-4d3220fc48e6   42Gi       RWO            ocs-storagecluster-ceph-rbd   3d16h   Filesystem

NAME                                                                               DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox100-placement-drpc   secondary      Primary

NAME                              READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-41-5c55b45d49-fngg2   1/1     Running   2          3d16h   10.128.3.196   compute-0   <none>           <none>




oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-101
NAME                                                                AGE     VOLUMEREPLICATIONCLASS                 PVCNAME          DESIREDSTATE   CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc-41   3d16h   rbd-volumereplicationclass-473128587   busybox-pvc-41   primary        Primary

NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-41   Bound    pvc-ea52541b-acb4-4ecb-afd6-a00925bf3583   42Gi       RWO            ocs-storagecluster-ceph-rbd   3d16h   Filesystem

NAME                                                                              DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sub-busybox101-placement-1-drpc   secondary      Primary

NAME                              READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-41-5c55b45d49-59tbp   1/1     Running   2          3d16h   10.128.3.198   compute-0   <none>           <none>




oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-102
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-46f34593-ba2f-435c-801d-66b7371fd359   94Gi       RWX            ocs-storagecluster-cephfs   3d16h   Filesystem

NAME                                                                                  DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox102-placement-drpc   primary        Primary

NAME                             READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-1-7f9b67dc95-wq4tr   1/1     Running   2          3d16h   10.128.3.208   compute-0   <none>           <none>




oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-103
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-40f27ae0-e3f5-4e74-822e-0eab289f3232   94Gi       RWX            ocs-storagecluster-cephfs   3d16h   Filesystem

NAME                                                                                 DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox103-placement-1-drpc   primary        Primary

NAME                             READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-1-7f9b67dc95-v6h4s   1/1     Running   2          3d16h   10.128.3.209   compute-0   <none>           <none>




Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/05may24/


Expected results: Admin should be able to successfully failover/relocate the workloads independent of eviction period timeout post hub recovery.


Additional info:

Comment 3 Aman Agrawal 2024-05-06 08:43:59 UTC

For the ease of reference, DRPolicy got validated on passive hub around Sun May  5 17:01:00 UTC 2024 and failover/relocate was performed somewhere close to Sun May  5 17:32:24 UTC 2024

Comment 5 Aman Agrawal 2024-05-06 17:55:43 UTC

Relocate and cleanup of failedover workload completed successfully after the eviction period.

Note You need to log in before you can comment on or make changes to this bug.