2276222 – [RDR] [Hub recovery] [Co-situated] Primary workloads become secondary, UI also shows incorrect information

Bug 2276222 - [RDR] [Hub recovery] [Co-situated] Primary workloads become secondary, UI also shows incorrect information [NEEDINFO]

Summary: [RDR] [Hub recovery] [Co-situated] Primary workloads become secondary, UI als...

Keywords:
Status:	ON_QA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Benamar Mekhissi
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-04-20 18:10 UTC by Aman Agrawal
Modified:	2024-06-13 13:24 UTC (History)
CC List:	3 users (show)
Fixed In Version:	4.16.0-126
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	bmekhiss: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	RamenDR ramen pull 1440	0	None	open	Ignore Secondary VRG Post-HubRecovery to Ensure Proper Reconciliation	2024-06-05 12:16:55 UTC
Github	red-hat-storage ramen pull 296	0	None	open	Bug 2276222: Ignore Secondary VRG Post-HubRecovery to Ensure Proper Reconciliation	2024-06-11 14:05:22 UTC

Description Aman Agrawal 2024-04-20 18:10:49 UTC

Created attachment 2028078 [details]
Image-1

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):

ACM 2.10.1 GA'ed
MCE 2.5.2
ODF 4.15.1-1
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
OCP 4.15.0-0.nightly-2024-04-07-120427
Submariner 0.17.0 GA'ed
VolSync 0.9.1

Platform- VMware

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
*****Active hub co-situated with primary managed cluster*****

1. When we have multiple workloads (RBD and CephFS) of both subscription and appset types (pull model) and in different states Deployed, FailedOver, Relocated running on primary managed cluster (C1) which goes down along with 
active hub during site failure at site-1, perform hub recovery and move to passive hub at site-2 (which is co-situated with secondary managed cluster C2).
2. Ensure the available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated.
2. After DRPC is restored, failover all the workloads to available managed cluster C2.
3. When failover is successful, recover the down managed cluster C1 and ensure it's successfully cleaned.
4. Let IOs continue for some time and configure another hub cluster at site-1 to perform hub recovery one more time.
5. Deploy 1 rbd appset (pull)/sub and 1 cephfs appset (pull)/sub on C1 and failover them to C2 (with both the managed clusters up and running).
6. Now relocate some of older workloads to the managed cluster C1 (cluster which was recovered post disaster) and leave remaining workloads as it is on C2 in the failover state.
7.  After successful relocate and cleanup, ensure new backups are taken and then perform hub recovery by bringing current active hub at site-2 and C1 cluster down which is at site-1. When moved to new hub at site-1, ensure available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated.
8. When drpc is restored, check for Pods/PVCs/VRs/VRG for the workloads which were running on available cluster C2. Check their last action status on RHACM console and try to failover them.


Actual results: 

Hub-

For step 8, DRPolicy was validated on new hub at site-1 around
amanagrawal@Amans-MacBook-Pro ~ % date -u
Sat Apr 20 12:48:46 UTC 2024


amanagrawal@Amans-MacBook-Pro ~ % oc get drpc -o wide -A|grep -v Cleaning
NAMESPACE              NAME                                     AGE     PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION   PEER READY
busybox-workloads-13   cephfs-sub-busybox13-placement-1-drpc    4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False
busybox-workloads-14   cephfs-sub-busybox14-placement-1-drpc    4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False
busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False
busybox-workloads-16   cephfs-sub-busybox16-placement-1-drpc    4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False
busybox-workloads-23   cephfs-sub-busybox23-placement-1-drpc    4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False
openshift-gitops       cephfs-appset-busybox21-placement-drpc   4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False
openshift-gitops       cephfs-appset-busybox5-placement-drpc    4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False
openshift-gitops       cephfs-appset-busybox6-placement-drpc    4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False
openshift-gitops       cephfs-appset-busybox8-placement-drpc    4h36m   amagrawa-c1-13apr   amagrawa-c2-13apr   Failover                      Paused                                          False


All these workloads should be primary on C2, however, they are marked as secondary (when C1 is down)


C2-


amanagrawal@Amans-MacBook-Pro c2 % oc get applications -A
NAMESPACE          NAME                                        SYNC STATUS   HEALTH STATUS
openshift-gitops   cephfs-appset-busybox25-amagrawa-c2-13apr   Synced        Healthy
openshift-gitops   rbd-appset-busybox1-amagrawa-c2-13apr       Synced        Healthy
openshift-gitops   rbd-appset-busybox2-amagrawa-c2-13apr       Synced        Healthy
openshift-gitops   rbd-appset-busybox22-amagrawa-c2-13apr      Synced        Healthy
openshift-gitops   rbd-appset-busybox26-amagrawa-c2-13apr      Synced        Healthy
openshift-gitops   rbd-appset-busybox3-amagrawa-c2-13apr       Synced        Healthy
openshift-gitops   rbd-appset-busybox4-amagrawa-c2-13apr       Synced        Healthy


amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-13 

NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-dea0dafa-3256-4127-9907-fd17db157162   94Gi       RWX            ocs-storagecluster-cephfs   5d9h   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox13-placement-1-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-xfgx4   1/1     Running   0          4h22m   10.128.3.127   compute-1   <none>           <none>




amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-14
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-0be4e225-2cef-463a-bebc-aa2d4792c415   94Gi       RWX            ocs-storagecluster-cephfs   5d9h   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox14-placement-1-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-gpmwk   1/1     Running   0          4h24m   10.128.3.120   compute-1   <none>           <none>






amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-15
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-712e358d-254e-4883-afa7-16f615ddcba8   94Gi       RWX            ocs-storagecluster-cephfs   5d9h   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-p784q   1/1     Running   0          4h29m   10.128.3.122   compute-1   <none>           <none>





amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-16
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-784774a0-9e8f-4161-be10-014656e40dd4   94Gi       RWX            ocs-storagecluster-cephfs   5d9h   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox16-placement-1-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-cgz69   1/1     Running   0          4h29m   10.128.3.128   compute-1   <none>           <none>





amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-23
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-359661fa-407e-47c9-a52a-c0eddb0c13a7   94Gi       RWX            ocs-storagecluster-cephfs   3d3h   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox23-placement-1-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-xjktv   1/1     Running   0          4h29m   10.128.3.126   compute-1   <none>           <none>





amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-21
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-2fc0bad5-fdeb-4456-8963-9beb185ed0df   94Gi       RWX            ocs-storagecluster-cephfs   3d3h   Filesystem

NAME                                                                                 DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox21-placement-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-crf9m   1/1     Running   0          4h30m   10.128.3.118   compute-1   <none>           <none>





amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-5 
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-3e82eb80-4edb-43b0-bc8a-4d5e84b9cd5c   94Gi       RWX            ocs-storagecluster-cephfs   6d21h   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox5-placement-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-pbtqq   1/1     Running   0          4h30m   10.128.3.121   compute-1   <none>           <none>





amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-6
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-3f7fd931-f881-4b79-9f6b-b39b606fb3b8   94Gi       RWX            ocs-storagecluster-cephfs   6d21h   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox6-placement-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-jkq62   1/1     Running   0          4h30m   10.128.3.119   compute-1   <none>           <none>





amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-8
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE     VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-128441d5-0eb2-457e-aa0b-5f5052e96939   94Gi       RWX            ocs-storagecluster-cephfs   6d21h   Filesystem

NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox8-placement-drpc   secondary      Secondary

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-jmx5h   1/1     Running   0          4h30m   10.128.3.123   compute-1   <none>           <none>


Pls note, I only had cephfs workloads on C2 in this case.

Also, there are 2 issues On UI:

1. The Failover/Relocate status for these workloads is empty, meaning UI interprets that no action was performed on these workloads however drpc output above shows their prior action was Failover.


Screencast- https://drive.google.com/file/d/1z1TeBeS3MZU9-4BUFIf4JV-tH3webwe2/view?usp=sharing


2. PEER READY is marked as False (which is correct as these workloads should be primary on C2 and it's peer cluster is C1 which is down), so we can not failover these workloads from UI.

But if we still try to failover them, the Target cluster is by default set to amagrawa-c2-13apr (C2) for appsets which is incorrect.

<<Image-1>>

It should actually be cluster C1 which is down.


For subscription since the selection is manual, I will attach screenshots for each cluster selection as target cluster for better understanding.

<<Image-2>> and <<Image-3>>


Expected results: 
1. Workloads should retain their original state (they should be primary on cluster C2) 
2. UI should show the correct information about failover status
3. UI should show the correct Target cluster selection for apps running on cluster C2


Additional info:

Comment 11 Benamar Mekhissi 2024-04-29 11:49:32 UTC

Aside from the workaround, we'll provide a fix in 4.16.

Comment 16 Aman Agrawal 2024-05-15 09:27:22 UTC

As discussed, proposing it back to 4.16

Note You need to log in before you can comment on or make changes to this bug.