Bug 2278849 - [RDR] [Hub recovery] [Co-situated] Workload deletion remains stuck forever reporting Deleting and do not progress
Summary: [RDR] [Hub recovery] [Co-situated] Workload deletion remains stuck forever re...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Benamar Mekhissi
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-05-03 11:05 UTC by Aman Agrawal
Modified: 2024-05-06 21:00 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Aman Agrawal 2024-05-03 11:05:53 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
ACM 2.10.2 GA'ed
MCE 2.5.2
ODF 4.15.2-1 GA'ed
ceph version 17.2.6-209.el9cp (e9529323dd7ab3b0e8cdf84e17a1b58c2b42948c) quincy (stable)
OCP 4.15.0-0.nightly-2024-04-30-234425
Submariner 0.17.1 GA'ed
VolSync 0.9.1


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
*****Active hub co-situated with primary managed cluster*****

1. When we have multiple workloads (RBD and CephFS) of both subscription and appset types (pull model) and in different states Deployed, FailedOver, Relocated running on primary managed cluster (C1) which goes down along with 
active hub during site failure at site-1, perform hub recovery and move to passive hub at site-2 (which is co-situated with secondary managed cluster C2).
2. Ensure the available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated.
2. After DRPC is restored, failover all the workloads to available managed cluster C2.
3. When failover is successful, recover the down managed cluster C1 and ensure it's successfully cleaned.
4. Let IOs continue for some time and configure another hub cluster at site-1 to perform hub recovery one more time.
5. Deploy 1 rbd appset (pull)/sub and 1 cephfs appset (pull)/sub on C1 and failover them to C2 (with both the managed clusters up and running).
6. Now relocate some of older workloads to the managed cluster C1 (cluster which was recovered post disaster) and leave remaining workloads as it is on C2 in the failover state.
7.  After successful relocate and cleanup, ensure new backups are taken and then perform hub recovery by bringing current active hub at site-2 and C1 cluster down which is at site-1. When moved to new hub at site-1, ensure available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated.
8. When drpc is restored, check for Pods/PVCs/VRs/VRG for the workloads which were running on available cluster C2. Check their last action status on RHACM console and try to failover them.

So far the steps to reproduce are same as BZ2276222. Here, primary workloads on C2 had become secondary hence the workaround mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2276222#c9 was applied. 
9. After failover to C2 completes(happens due to applying the workaround), recover the down managed cluster C1 and ensure it's successfully cleaned and data sync resumes as expected.
10. Now configure another hub cluster for hub recovery, perform hub recovery by bringing current active hub and C1 cluster down. 
11. When moved to new hub, ensure C2 managed cluster is successfully imported, DRPolicy is validated and VolumeSync.Delay alert is being fired as C1 is down and sync isn't progressing. 
12. Now recover the down C1 managed cluster and let IOs continue for some time and then delete all the workloads which are in FailedOver state on C2.



Actual results: Workload deletion remained stuck forever.

Hub-

amanagrawal@Amans-MacBook-Pro acm % drpc
NAMESPACE               NAME                                      AGE   PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION       PEER READY
busybox-workloads-10    rbd-sub-busybox10-placement-1-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-101   rbd-sub-busybox101-placement-1-drpc       17h   amagrawa-c1-13apr                                      Deployed       Completed     2024-05-02T17:45:02Z   1.033378153s   True
busybox-workloads-103   cephfs-sub-busybox103-placement-1-drpc    17h   amagrawa-c1-13apr                                      Deployed       Completed     2024-05-02T17:45:05Z   578.362118ms   True
busybox-workloads-11    rbd-sub-busybox11-placement-1-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-12    rbd-sub-busybox12-placement-1-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-13    cephfs-sub-busybox13-placement-1-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-14    cephfs-sub-busybox14-placement-1-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-15    cephfs-sub-busybox15-placement-1-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-16    cephfs-sub-busybox16-placement-1-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-23    cephfs-sub-busybox23-placement-1-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-24    rbd-sub-busybox24-placement-1-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-27    cephfs-sub-busybox27-placement-1-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-28    rbd-sub-busybox28-placement-1-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
busybox-workloads-9     rbd-sub-busybox9-placement-1-drpc         17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        cephfs-appset-busybox102-placement-drpc   17h   amagrawa-c1-13apr                                      Deployed       Completed     2024-05-02T17:44:59Z   2.133922784s   True
openshift-gitops        cephfs-appset-busybox21-placement-drpc    17h   amagrawa-c2-13apr   amagrawa-c1-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        cephfs-appset-busybox25-placement-drpc    17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        cephfs-appset-busybox5-placement-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        cephfs-appset-busybox6-placement-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        cephfs-appset-busybox8-placement-drpc     17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        rbd-appset-busybox1-placement-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        rbd-appset-busybox100-placement-drpc      17h   amagrawa-c1-13apr                                      Deployed       Completed     2024-05-02T17:45:04Z   710.593214ms   True
openshift-gitops        rbd-appset-busybox2-placement-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        rbd-appset-busybox22-placement-drpc       17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        rbd-appset-busybox26-placement-drpc       17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        rbd-appset-busybox3-placement-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True
openshift-gitops        rbd-appset-busybox4-placement-drpc        17h   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover       FailedOver     Deleting                                            True



No change of state on workloads was observed, meaning none of the resources went to Terminating state. Even workload pods are up and running.

C2-


amanagrawal@Amans-MacBook-Pro c2 % busybox-3
Now using project "busybox-workloads-3" on server "https://api.amagrawa-c2-13apr.qe.rh-ocs.com:6443".
NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE   VOLUMEMODE
persistentvolumeclaim/busybox-pvc-41   Bound    pvc-6cc700d3-bc86-4f40-b217-262121d40589   42Gi       RWO            ocs-storagecluster-ceph-rbd   12d   Filesystem
persistentvolumeclaim/busybox-pvc-42   Bound    pvc-eb26f91a-2e22-43ed-b238-cad3e5eeb0a2   81Gi       RWO            ocs-storagecluster-ceph-rbd   12d   Filesystem
persistentvolumeclaim/busybox-pvc-43   Bound    pvc-3519f05a-e3bd-4772-9bb2-b1d3b5a231f0   28Gi       RWO            ocs-storagecluster-ceph-rbd   12d   Filesystem
persistentvolumeclaim/busybox-pvc-44   Bound    pvc-adde9d60-03ea-43c2-b59d-c15cbd3bfd6c   118Gi      RWO            ocs-storagecluster-ceph-rbd   12d   Filesystem

NAME                                                                AGE   VOLUMEREPLICATIONCLASS                 PVCNAME          DESIREDSTATE   CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc-41   12d   rbd-volumereplicationclass-539797778   busybox-pvc-41   primary        Primary
volumereplication.replication.storage.openshift.io/busybox-pvc-42   12d   rbd-volumereplicationclass-539797778   busybox-pvc-42   primary        Primary
volumereplication.replication.storage.openshift.io/busybox-pvc-43   12d   rbd-volumereplicationclass-539797778   busybox-pvc-43   primary        Primary
volumereplication.replication.storage.openshift.io/busybox-pvc-44   12d   rbd-volumereplicationclass-539797778   busybox-pvc-44   primary        Primary

NAME                                                                             DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox3-placement-drpc   primary        Primary

NAME                              READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-41-5c55b45d49-vgh2x   1/1     Running   0          12d   10.128.3.142   compute-1   <none>           <none>
pod/busybox-42-6c6c94c475-pqq52   1/1     Running   0          12d   10.129.2.78    compute-2   <none>           <none>
pod/busybox-43-5b56997c7b-5hgld   1/1     Running   0          12d   10.129.2.77    compute-2   <none>           <none>
pod/busybox-44-57856dfdb-4v9tc    1/1     Running   0          12d   10.128.3.143   compute-1   <none>           <none>





amanagrawal@Amans-MacBook-Pro c2 % busybox-12
Now using project "busybox-workloads-12" on server "https://api.amagrawa-c2-13apr.qe.rh-ocs.com:6443".
NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE   VOLUMEMODE
persistentvolumeclaim/busybox-pvc-41   Bound    pvc-bffe79d1-e524-467d-9797-a48346a3a535   42Gi       RWO            ocs-storagecluster-ceph-rbd   12d   Filesystem
persistentvolumeclaim/busybox-pvc-42   Bound    pvc-16145311-ad20-4ec7-b3a4-2a5635eefad6   81Gi       RWO            ocs-storagecluster-ceph-rbd   12d   Filesystem
persistentvolumeclaim/busybox-pvc-43   Bound    pvc-c8fb5fcd-0e69-45ce-9f35-7b22d2b09767   28Gi       RWO            ocs-storagecluster-ceph-rbd   12d   Filesystem
persistentvolumeclaim/busybox-pvc-44   Bound    pvc-29ea5d4b-746b-4200-a7e7-cbb2e83984b4   118Gi      RWO            ocs-storagecluster-ceph-rbd   12d   Filesystem

NAME                                                                AGE   VOLUMEREPLICATIONCLASS                  PVCNAME          DESIREDSTATE   CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc-41   12d   rbd-volumereplicationclass-1625360775   busybox-pvc-41   primary        Primary
volumereplication.replication.storage.openshift.io/busybox-pvc-42   12d   rbd-volumereplicationclass-1625360775   busybox-pvc-42   primary        Primary
volumereplication.replication.storage.openshift.io/busybox-pvc-43   12d   rbd-volumereplicationclass-1625360775   busybox-pvc-43   primary        Primary
volumereplication.replication.storage.openshift.io/busybox-pvc-44   12d   rbd-volumereplicationclass-1625360775   busybox-pvc-44   primary        Primary

NAME                                                                             DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sub-busybox12-placement-1-drpc   primary        Primary

NAME                              READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-41-5c55b45d49-vq9tl   1/1     Running   0          12d   10.129.2.64    compute-2   <none>           <none>
pod/busybox-42-6c6c94c475-ftvww   1/1     Running   0          12d   10.129.2.66    compute-2   <none>           <none>
pod/busybox-43-5b56997c7b-gn6bt   1/1     Running   0          12d   10.131.0.179   compute-0   <none>           <none>
pod/busybox-44-57856dfdb-nkwn4    1/1     Running   0          12d   10.131.0.180   compute-0   <none>           <none>




amanagrawal@Amans-MacBook-Pro c2 % busybox-8 
Now using project "busybox-workloads-8" on server "https://api.amagrawa-c2-13apr.qe.rh-ocs.com:6443".
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE   VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-128441d5-0eb2-457e-aa0b-5f5052e96939   94Gi       RWX            ocs-storagecluster-cephfs   19d   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox8-placement-drpc   primary        Primary

NAME                             READY   STATUS    RESTARTS   AGE    IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-1-7f9b67dc95-ps9ss   1/1     Running   0          5d1h   10.128.3.173   compute-1   <none>           <none>




amanagrawal@Amans-MacBook-Pro c2 % busybox-27
zsh: command not found: busybox-27
amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-27
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE   VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-b57ddefe-b1f8-4985-b2d4-58d814579c80   94Gi       RWX            ocs-storagecluster-cephfs   15d   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox27-placement-1-drpc   primary        Primary

NAME                             READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-1-7f9b67dc95-4ddh8   1/1     Running   0          12d   10.128.3.137   compute-1   <none>           <none>



Expected results: Workload deletion should be successful. Pods/PVCs/VRs/VRGs/PV and it's images should be cleaned.


Additional info:

Comment 8 Aman Agrawal 2024-05-04 09:22:54 UTC
I couldn't try the workaround but yes, cleanup completed after 24hr eviction period which starts right after the managed cluster is able to connect successfully to the passive hub during/after hub recovery


Note You need to log in before you can comment on or make changes to this bug.