Bug 2216676

Summary: [RDR][ACM-Tracker] Cleanup of primary cluster remains stuck for app-set when failover is performed
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Karolin Seeger <kseeger>
Component: documentationAssignee: Karolin Seeger <kseeger>
Status: ON_QA --- QA Contact: Aman Agrawal <amagrawa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.13CC: amagrawa, kramdoss, muagarwa, odf-bz-bot
Target Milestone: ---   
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Karolin Seeger 2023-06-22 08:03:24 UTC
Bug to track same issue for MDR: https://bugzilla.redhat.com/show_bug.cgi?id=2185953


Description of problem (please be detailed as possible and provide log
snippests):
In continuation to this message here- https://bugzilla.redhat.com/show_bug.cgi?id=2184748#c3
The cluster was in the same state for 4days where workloads were running on C2. So did  all pre-checks and performed a failover operation from C2 to C1.
Now moving to steps to repro.

Version of all relevant components (if applicable):
ACM 2.7.2
ODF 4.13.0-121.stable

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Ran IOs (app-set based workloads) on C2 for a few days as mentioned above in the Description.
2. Shut down all master nodes of C2
3. Scale down rbd-mirror daemon pod on C1
4. Edit drpc yaml from hub and trigger failover to C1
5. Scale up rbd-mirror daemon pod on C1 when failover completes
6. Bring up the C2 master nodes after for a few hours (2-3 hrs)
7. Observe C2 and wait for cleanup to complete


Actual results: Cleanup of primary cluster remains stuck for app-set when failover is performed

C2-

VR and VRG got cleaned in this case but Pods/PVCs remain stuck forever

amagrawa:~$ oc get pods,pvc,vr,vrg
NAME                              READY   STATUS    RESTARTS   AGE
pod/busybox-41-6b687497df-25zdg   1/1     Running   0          4d20h
pod/busybox-42-5479f6d5dc-4xz5t   1/1     Running   0          4d20h
pod/busybox-43-6d57d9d898-lcf2c   1/1     Running   0          4d20h
pod/busybox-44-6985f98f44-sjtv6   1/1     Running   0          4d20h
pod/busybox-45-7879f49d7b-jh5fd   1/1     Running   0          4d20h
pod/busybox-46-54bc657fc4-s8v5w   1/1     Running   0          4d20h
pod/busybox-47-5bfdc6d579-qwscg   1/1     Running   0          4d20h
pod/busybox-48-58dd4fc4b4-wcp89   1/1     Running   0          4d20h
pod/busybox-49-799ddc584-hxm8w    1/1     Running   0          4d20h
pod/busybox-50-58588b9ffb-dxn2b   1/1     Running   0          4d20h
pod/busybox-51-54868dd48d-8q8hh   1/1     Running   0          4d20h
pod/busybox-52-5b64fb9cff-9g28m   1/1     Running   0          4d20h
pod/busybox-53-699dff5bd4-k5mqr   1/1     Running   0          4d20h
pod/busybox-54-788744468c-drwss   1/1     Running   0          4d20h
pod/busybox-55-6bc89678b4-4rckw   1/1     Running   0          4d20h
pod/busybox-56-db586d8c8-z4qzt    1/1     Running   0          4d20h
pod/busybox-57-759979888c-kx462   1/1     Running   0          4d20h
pod/busybox-58-84fb689c4f-bm6cp   1/1     Running   0          4d20h
pod/busybox-59-59b77d856c-jj5xq   1/1     Running   0          4d20h
pod/busybox-60-57d4ff68d-hq9cd    1/1     Running   0          4d20h

NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
persistentvolumeclaim/busybox-pvc-41   Bound    pvc-d0b72be5-22f5-45ba-bbf6-2281fdebefbf   42Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-42   Bound    pvc-9b9cf55a-a75c-4d30-97d7-4b9ff2722431   81Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-43   Bound    pvc-dcf4c325-adc5-48bd-8419-e09e6a787a39   28Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-44   Bound    pvc-e00a4ac9-e813-4a51-8c22-56b8731e4bb7   118Gi      RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-45   Bound    pvc-13f42be9-36c4-414f-b492-0c0340b29afa   19Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-46   Bound    pvc-03235a82-37e8-4d6c-ad48-470e8e98fdd7   129Gi      RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-47   Bound    pvc-c59917a7-1d32-46f6-a4b1-59855ea47070   43Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-48   Bound    pvc-5a4f302b-2cad-470f-9b8c-108150635fdf   57Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-49   Bound    pvc-e28d6929-9446-4b02-bd41-5a24f0a28d2d   89Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-50   Bound    pvc-8eedf9cb-e49b-487d-a4e1-97a41a30099c   124Gi      RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-51   Bound    pvc-f18cf24a-5b40-490c-994d-b12ed9600a45   95Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-52   Bound    pvc-5560146e-8134-4e2c-b5b6-368d5294f30c   129Gi      RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-53   Bound    pvc-dc442fe7-d45f-4d95-86d0-2139b71a5e04   51Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-54   Bound    pvc-36da406f-42c9-4761-8139-0131ea9d951b   30Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-55   Bound    pvc-ad09eb08-cac7-4902-ba2c-b22acc1c7586   102Gi      RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-56   Bound    pvc-d879c820-f6c2-4a3c-b7e8-2112b703e936   40Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-57   Bound    pvc-dfbd0f5f-4f7b-4700-80fc-58ae403abc42   146Gi      RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-58   Bound    pvc-86da9780-49c6-4bbf-b7de-053b9886568d   63Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-59   Bound    pvc-11f13a73-f21c-467f-b678-52e168471e66   118Gi      RWO            ocs-storagecluster-ceph-rbd   4d20h
persistentvolumeclaim/busybox-pvc-60   Bound    pvc-776748d4-ff0e-4a14-9598-31a9ef8019ab   25Gi       RWO            ocs-storagecluster-ceph-rbd   4d20h


No events seen on pods/pvcs running on C2

Expected results: Cleanup should complete


Additional info:

Comment 3 Karolin Seeger 2023-06-29 07:58:36 UTC
Documentation is available here and has been successfully tested for MDR: https://bugzilla.redhat.com/show_bug.cgi?id=2185953#c24.
Moving bug to ON_QA.

Comment 4 Aman Agrawal 2023-07-11 05:52:47 UTC
This bug verification is blocked due to- https://issues.redhat.com/browse/ACM-5796 as the issue with submariner connectivity is consistent.