Bug 2248664 - [RDR] [Hub recovery] A workload which was in failedover state before hub recovery goes to cleaning up on passive hub
Summary: [RDR] [Hub recovery] A workload which was in failedover state before hub reco...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.15.0
Assignee: Shyamsundar
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On:
Blocks: 2248826
TreeView+ depends on / blocked
 
Reported: 2023-11-08 08:45 UTC by Aman Agrawal
Modified: 2024-03-19 15:28 UTC (History)
3 users (show)

Fixed In Version: 4.15.0-130
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2248826 (view as bug list)
Environment:
Last Closed: 2024-03-19 15:28:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ramen pull 162 0 None Merged Syncing latest changes from upstream main for ramen 2024-01-31 19:12:39 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:28:30 UTC

Description Aman Agrawal 2023-11-08 08:45:06 UTC
Description of problem (please be detailed as possible and provide log
snippests): The active hub was located at a neutral site.


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-06-203803
advanced-cluster-management.v2.9.0-204
ACM 2.9.0-DOWNSTREAM-2023-11-03-14-27-40
Submariner brew.registry.redhat.io/rh-osbs/iib:615928
ODF 4.14.0-161
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a RDR setup, configure it for hub recovery where we had deployed multiple workloads of both appset and subscription types backed by rbd and cephfs. Failover (with nodes up) some of them to C2 and then back to C1. Relocate some of them to C2 and back to C1.
2. Also leave a few worklaods on C1 in deployed state (both types). Also deploy a few rbd and cephfs workloads on C2 and left them in deployed state.
3. When the failover/relocate was done, it was ensured that the progression reported completed for all of them and new backup is taken on both active and passive hub clusters for all the workloads with their final state.
4. Do all pre-checks such as sync status, volumereplicationclass, ceph health, mirror status, lastGroupSyncTime, managedclusters -o wide status, alerts, odf pods, etc.
5. Collect drpc -o wide output from active hub and then bring active hub down.
6. Restore backup on passive hub and ensure both the managed clusters are successfully imported.
7. Wait for DRPolicy to get validated.
8. Check drpc -o wide on passive hub and match it to the output taken from active hub. 


Actual results: Out of all the workloads, one of the appset based cephfs workload which was in failedover state on active hub changed it's progression from completed to cleaning up on passive hub.

From active hub- 

openshift-gitops       appset-cephfs1-placement-drpc           3h3m   amagrawa-m1-7nov   amagrawa-m1-7nov   Failover       FailedOver     Completed     2023-11-07T18:38:48Z   2m52.538302007s   True


From passive hub-

openshift-gitops       appset-cephfs1-placement-drpc           21m   amagrawa-m1-7nov   amagrawa-m1-7nov   Failover       FailedOver     Cleaning Up  


It's running in NS busybox-workloads-3

C1 (amagrawa-m1-7nov)-

amagrawa:~$ busybox-3
Already on project "busybox-workloads-3" on server "https://api.amagrawa-m1-7nov.qe.rh-ocs.com:6443".
NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1    Bound    pvc-ee53ddcf-fbd7-495f-aa13-c23a31e61203   94Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-10   Bound    pvc-3ee14172-ee31-4c9f-9610-c0a858fdd427   87Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-11   Bound    pvc-8f155598-856c-4f4a-bd15-e3c0e4500d98   33Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-12   Bound    pvc-76d04364-6344-4cc6-b98e-5aad949421d1   147Gi      RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-13   Bound    pvc-915b7392-2852-4a1c-89f6-8a1b1a54f752   77Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-14   Bound    pvc-6e84ce0b-52f8-4d5d-ba64-b344783a9e70   70Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-15   Bound    pvc-c258d8ec-8bd4-4fa0-8234-d82186bd4cad   131Gi      RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-16   Bound    pvc-76f25ddc-d28d-400d-a9b3-72f0d6e4bc25   127Gi      RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-17   Bound    pvc-1b689b65-d21f-4f15-b826-a0f00fe5f05e   58Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-18   Bound    pvc-0f3b765f-2981-413d-95ef-6a956e9c5bb3   123Gi      RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-19   Bound    pvc-18bb8470-e1ab-4135-9586-fade1d01acfa   61Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-2    Bound    pvc-94a731b0-7460-453c-bb7e-3967f1d2f745   44Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-20   Bound    pvc-6acd5f0a-8c29-441b-8172-04b7b69497ed   33Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-3    Bound    pvc-ad829d43-edd6-40c0-9938-fc7c1ada3c00   76Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-4    Bound    pvc-ffcd0520-3a0b-4040-ba64-31b06279619e   144Gi      RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-5    Bound    pvc-abc844ef-4d0b-4a73-bb13-abdc5df2e172   107Gi      RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-6    Bound    pvc-81022044-f1cc-4dd0-a400-8f028b50970a   123Gi      RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-7    Bound    pvc-c7a1e894-6340-4afc-a912-18f967d27999   90Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-8    Bound    pvc-361b194d-2606-4871-b12a-3557d6e02da9   91Gi       RWX            ocs-storagecluster-cephfs   4h6m   Filesystem
persistentvolumeclaim/busybox-pvc-9    Bound    pvc-1b533598-db26-429e-a183-2cb9b7239edd   111Gi      RWX            ocs-storagecluster-cephfs   4h6m   Filesystem

NAME                                                                        DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/appset-cephfs1-placement-drpc   secondary      Secondary

NAME                              READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
pod/busybox-1-7f7bf8c5d9-skq4d    1/1     Running   0          81m   10.131.3.17    compute-5   <none>           <none>
pod/busybox-10-7b7bddddf8-pnfkj   1/1     Running   0          81m   10.131.1.93    compute-1   <none>           <none>
pod/busybox-11-6c4cf4bfb-k2rv2    1/1     Running   0          81m   10.131.3.15    compute-5   <none>           <none>
pod/busybox-12-7968f7d4bb-wtszw   1/1     Running   0          81m   10.129.2.206   compute-2   <none>           <none>
pod/busybox-13-674b97b564-ddwfn   1/1     Running   0          81m   10.129.2.205   compute-2   <none>           <none>
pod/busybox-14-f59899658-hhnh7    1/1     Running   0          81m   10.131.3.16    compute-5   <none>           <none>
pod/busybox-15-867dd79cbd-dr4t2   1/1     Running   0          81m   10.131.1.96    compute-1   <none>           <none>
pod/busybox-16-866d576d54-x4pxw   1/1     Running   0          81m   10.131.1.95    compute-1   <none>           <none>
pod/busybox-17-8d7df8b76-b5dkr    1/1     Running   0          81m   10.128.4.214   compute-3   <none>           <none>
pod/busybox-18-75cdf6f4c4-9t2rd   1/1     Running   0          81m   10.130.2.246   compute-4   <none>           <none>
pod/busybox-19-6bcbc84d68-2fmdf   1/1     Running   0          81m   10.129.2.204   compute-2   <none>           <none>
pod/busybox-2-5cffb67686-cv7bz    1/1     Running   0          81m   10.130.2.244   compute-4   <none>           <none>
pod/busybox-20-fdbd78dbd-bfv4b    1/1     Running   0          81m   10.128.4.212   compute-3   <none>           <none>
pod/busybox-3-7ffc7c8fbb-5krv6    1/1     Running   0          81m   10.131.1.94    compute-1   <none>           <none>
pod/busybox-4-66688c494b-zdwsz    1/1     Running   0          81m   10.129.2.203   compute-2   <none>           <none>
pod/busybox-5-56978ff94-lr4sb     1/1     Running   0          81m   10.131.3.14    compute-5   <none>           <none>
pod/busybox-6-57544b458b-5zntb    1/1     Running   0          81m   10.128.4.213   compute-3   <none>           <none>
pod/busybox-7-77ff998b8b-mj66t    1/1     Running   0          81m   10.130.2.245   compute-4   <none>           <none>
pod/busybox-8-6d5cdc5678-9ljxx    1/1     Running   0          81m   10.129.2.202   compute-2   <none>           <none>
pod/busybox-9-79c789995d-zs66w    1/1     Running   0          81m   10.129.2.207   compute-2   <none>           <none>


C2 (amagrawa-m2-7nov)-

amagrawa:~$ busybox-3
Already on project "busybox-workloads-3" on server "https://api.amagrawa-m2-7nov.qe.rh-ocs.com:6443".
NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1    Bound    pvc-1675b311-82f1-4d64-907c-71e63bae43d6   94Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-10   Bound    pvc-263924c4-89a6-49ca-a128-2fc44294e738   87Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-11   Bound    pvc-7fecdbf9-cd7e-47d0-8085-dca4b0391969   33Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-12   Bound    pvc-0c7f9f9b-5796-4b0f-9a11-7d885105b856   147Gi      RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-13   Bound    pvc-bd4ee9eb-07be-4ee1-a767-cd16284ef9a0   77Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-14   Bound    pvc-7e52d0be-7110-4535-8fba-3354d1f201ea   70Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-15   Bound    pvc-e9eb54ab-2736-4e71-8f39-d7ee3fc0d5b3   131Gi      RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-16   Bound    pvc-d6242a28-a9c5-4b69-8b97-82844b925b11   127Gi      RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-17   Bound    pvc-d9eebab7-45d2-4fea-bce4-944b5ea62286   58Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-18   Bound    pvc-02ff5b23-65f6-4858-a5f4-89eefe996228   123Gi      RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-19   Bound    pvc-cdcbb948-1cbe-4450-a316-7f58049c0847   61Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-2    Bound    pvc-9a2e8fe2-b9fa-4302-875d-892f8e40f005   44Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-20   Bound    pvc-4cb1565c-ae78-4149-bff9-635a9cb5e7b3   33Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-3    Bound    pvc-80b59ec4-7538-44cb-ba0d-23cef425d3f6   76Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-4    Bound    pvc-da4b7af5-f8e4-4570-bd8f-b846eba8ca37   144Gi      RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-5    Bound    pvc-9cc23ce8-9972-4af0-8567-edf5c23380b4   107Gi      RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-6    Bound    pvc-48e1e666-a0fb-4e05-afec-284341518040   123Gi      RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-7    Bound    pvc-7ba675ed-620d-40e8-b647-4ced9705c03f   90Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-8    Bound    pvc-aa7c324b-34dd-4ce7-a98b-027333166369   91Gi       RWX            ocs-storagecluster-cephfs   4h3m   Filesystem
persistentvolumeclaim/busybox-pvc-9    Bound    pvc-71d43337-a656-4747-b7e3-2dfbd6a99286   111Gi      RWX            ocs-storagecluster-cephfs   4h3m   Filesystem

NAME                                                                        DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/appset-cephfs1-placement-drpc   secondary      Secondary

NAME                                             READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-nd7fg    1/1     Running   0          4m55s   10.128.2.131   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-10-j8p5j   1/1     Running   0          5m4s    10.131.1.11    compute-2   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-11-kw5n7   1/1     Running   0          4m45s   10.128.2.136   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-12-dm4bh   1/1     Running   0          5m4s    10.131.1.10    compute-2   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-13-69fxm   1/1     Running   0          4m54s   10.131.1.12    compute-2   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-14-6cr48   1/1     Running   0          5m1s    10.128.2.130   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-15-bwhp7   1/1     Running   0          5m10s   10.131.1.9     compute-2   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-16-pxmc8   1/1     Running   0          4m52s   10.128.2.134   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-17-7gq8d   1/1     Running   0          4m43s   10.131.1.14    compute-2   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-18-6w6rv   1/1     Running   0          4m43s   10.128.2.137   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-19-hnjv4   1/1     Running   0          5m16s   10.128.2.127   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-2-gx2qs    1/1     Running   0          5m1s    10.128.2.129   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-20-76tz4   1/1     Running   0          4m46s   10.128.2.135   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-3-7wlwx    1/1     Running   0          4m40s   10.131.1.15    compute-2   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-4-2fq79    1/1     Running   0          4m54s   10.128.2.132   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-5-m8b9z    1/1     Running   0          5m10s   10.128.2.128   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-6-5m5dr    1/1     Running   0          5m13s   10.131.1.8     compute-2   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-7-wzqjk    1/1     Running   0          4m53s   10.128.2.133   compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-8-rx4sz    1/1     Running   0          4m33s   10.131.1.16    compute-2   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-9-8hmsz    1/1     Running   0          4m52s   10.131.1.13    compute-2   <none>           <none>


It should be primary on C1 however VRG on both sides were marked as secondary. If we look at the pods, src pods are being created on C1 and dst pods on C2 which is fine.

Since it goes to Cleaning up state, further failover/relocate can not be performed and data sync stops for this workload.


Expected results: The workload should be reconciled to the right state post hub recovery. 


Additional info:

Comment 12 errata-xmlrpc 2024-03-19 15:28:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.