Created attachment 2028078 [details] Image-1 Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): ACM 2.10.1 GA'ed MCE 2.5.2 ODF 4.15.1-1 ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable) OCP 4.15.0-0.nightly-2024-04-07-120427 Submariner 0.17.0 GA'ed VolSync 0.9.1 Platform- VMware Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: *****Active hub co-situated with primary managed cluster***** 1. When we have multiple workloads (RBD and CephFS) of both subscription and appset types (pull model) and in different states Deployed, FailedOver, Relocated running on primary managed cluster (C1) which goes down along with active hub during site failure at site-1, perform hub recovery and move to passive hub at site-2 (which is co-situated with secondary managed cluster C2). 2. Ensure the available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated. 2. After DRPC is restored, failover all the workloads to available managed cluster C2. 3. When failover is successful, recover the down managed cluster C1 and ensure it's successfully cleaned. 4. Let IOs continue for some time and configure another hub cluster at site-1 to perform hub recovery one more time. 5. Deploy 1 rbd appset (pull)/sub and 1 cephfs appset (pull)/sub on C1 and failover them to C2 (with both the managed clusters up and running). 6. Now relocate some of older workloads to the managed cluster C1 (cluster which was recovered post disaster) and leave remaining workloads as it is on C2 in the failover state. 7. After successful relocate and cleanup, ensure new backups are taken and then perform hub recovery by bringing current active hub at site-2 and C1 cluster down which is at site-1. When moved to new hub at site-1, ensure available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated. 8. When drpc is restored, check for Pods/PVCs/VRs/VRG for the workloads which were running on available cluster C2. Check their last action status on RHACM console and try to failover them. Actual results: Hub- For step 8, DRPolicy was validated on new hub at site-1 around amanagrawal@Amans-MacBook-Pro ~ % date -u Sat Apr 20 12:48:46 UTC 2024 amanagrawal@Amans-MacBook-Pro ~ % oc get drpc -o wide -A|grep -v Cleaning NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False busybox-workloads-23 cephfs-sub-busybox23-placement-1-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False openshift-gitops cephfs-appset-busybox21-placement-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False openshift-gitops cephfs-appset-busybox5-placement-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False openshift-gitops cephfs-appset-busybox6-placement-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False openshift-gitops cephfs-appset-busybox8-placement-drpc 4h36m amagrawa-c1-13apr amagrawa-c2-13apr Failover Paused False All these workloads should be primary on C2, however, they are marked as secondary (when C1 is down) C2- amanagrawal@Amans-MacBook-Pro c2 % oc get applications -A NAMESPACE NAME SYNC STATUS HEALTH STATUS openshift-gitops cephfs-appset-busybox25-amagrawa-c2-13apr Synced Healthy openshift-gitops rbd-appset-busybox1-amagrawa-c2-13apr Synced Healthy openshift-gitops rbd-appset-busybox2-amagrawa-c2-13apr Synced Healthy openshift-gitops rbd-appset-busybox22-amagrawa-c2-13apr Synced Healthy openshift-gitops rbd-appset-busybox26-amagrawa-c2-13apr Synced Healthy openshift-gitops rbd-appset-busybox3-amagrawa-c2-13apr Synced Healthy openshift-gitops rbd-appset-busybox4-amagrawa-c2-13apr Synced Healthy amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-13 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-dea0dafa-3256-4127-9907-fd17db157162 94Gi RWX ocs-storagecluster-cephfs 5d9h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox13-placement-1-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-xfgx4 1/1 Running 0 4h22m 10.128.3.127 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-14 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-0be4e225-2cef-463a-bebc-aa2d4792c415 94Gi RWX ocs-storagecluster-cephfs 5d9h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox14-placement-1-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-gpmwk 1/1 Running 0 4h24m 10.128.3.120 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-15 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-712e358d-254e-4883-afa7-16f615ddcba8 94Gi RWX ocs-storagecluster-cephfs 5d9h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-p784q 1/1 Running 0 4h29m 10.128.3.122 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-16 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-784774a0-9e8f-4161-be10-014656e40dd4 94Gi RWX ocs-storagecluster-cephfs 5d9h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox16-placement-1-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-cgz69 1/1 Running 0 4h29m 10.128.3.128 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-23 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-359661fa-407e-47c9-a52a-c0eddb0c13a7 94Gi RWX ocs-storagecluster-cephfs 3d3h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox23-placement-1-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-xjktv 1/1 Running 0 4h29m 10.128.3.126 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-21 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-2fc0bad5-fdeb-4456-8963-9beb185ed0df 94Gi RWX ocs-storagecluster-cephfs 3d3h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox21-placement-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-crf9m 1/1 Running 0 4h30m 10.128.3.118 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-5 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-3e82eb80-4edb-43b0-bc8a-4d5e84b9cd5c 94Gi RWX ocs-storagecluster-cephfs 6d21h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox5-placement-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-pbtqq 1/1 Running 0 4h30m 10.128.3.121 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-6 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-3f7fd931-f881-4b79-9f6b-b39b606fb3b8 94Gi RWX ocs-storagecluster-cephfs 6d21h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox6-placement-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-jkq62 1/1 Running 0 4h30m 10.128.3.119 compute-1 <none> <none> amanagrawal@Amans-MacBook-Pro c2 % oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-8 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-128441d5-0eb2-457e-aa0b-5f5052e96939 94Gi RWX ocs-storagecluster-cephfs 6d21h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox8-placement-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-jmx5h 1/1 Running 0 4h30m 10.128.3.123 compute-1 <none> <none> Pls note, I only had cephfs workloads on C2 in this case. Also, there are 2 issues On UI: 1. The Failover/Relocate status for these workloads is empty, meaning UI interprets that no action was performed on these workloads however drpc output above shows their prior action was Failover. Screencast- https://drive.google.com/file/d/1z1TeBeS3MZU9-4BUFIf4JV-tH3webwe2/view?usp=sharing 2. PEER READY is marked as False (which is correct as these workloads should be primary on C2 and it's peer cluster is C1 which is down), so we can not failover these workloads from UI. But if we still try to failover them, the Target cluster is by default set to amagrawa-c2-13apr (C2) for appsets which is incorrect. <<Image-1>> It should actually be cluster C1 which is down. For subscription since the selection is manual, I will attach screenshots for each cluster selection as target cluster for better understanding. <<Image-2>> and <<Image-3>> Expected results: 1. Workloads should retain their original state (they should be primary on cluster C2) 2. UI should show the correct information about failover status 3. UI should show the correct Target cluster selection for apps running on cluster C2 Additional info:
Aside from the workaround, we'll provide a fix in 4.16.
As discussed, proposing it back to 4.16