Bug 2247537 - [RDR] [UI] [Hub recovery] Current UI cannot initiate failover of workloads which were in any other state than deployed before hub recovery was performed
Summary: [RDR] [UI] [Hub recovery] Current UI cannot initiate failover of workloads wh...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.16.0
Assignee: Benamar Mekhissi
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-01 21:14 UTC by Aman Agrawal
Modified: 2024-05-06 12:39 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-05-06 12:39:06 UTC
Embargoed:


Attachments (Terms of Use)

Description Aman Agrawal 2023-11-01 21:14:30 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-10-30-170011
advanced-cluster-management.v2.9.0-188 
ODF 4.14.0-157
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25
Submariner brew.registry.redhat.io/rh-osbs/iib:607438


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a hub recovery RDR setup, ensure backups are being created on active and passive hub clusters. Failover and relocate different workloads so that it is running on the primary managed cluster after the failover and relocate operation completes. Ensure latest backups are taken and no action of any of the workloads (cephfs, rbd- appset or subscription type) is in progress.
2. Collect drpc status. Bring primary managed cluster down, and then bring active hub down.
3. Ensure secondary managed cluster is properly imported on the passive hub and then DRPolicy gets validated. 
4. Check the drpc status from passive hub and compare it with the output taken from active hub when it was up. We notice that post hub recovery, a sanity check is run for all the workloads which were failedover or relocated where we again perform the same action on those workloads which was performed from the active hub, which marks peer ready as false for those workloads.

from active hub-

NAMESPACE             NAME                                   AGE   PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION             PEER READY
busybox-workloads-2   subscription-cephfs-placement-1-drpc   9h    amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocated      Completed     2023-11-01T17:54:21Z   30.282249722s        True
busybox-workloads-5   subscription-rbd1-placement-1-drpc     9h    amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailedOver     Completed     2023-11-01T13:57:37Z   47m3.364814169s      True
busybox-workloads-6   subscription-rbd2-placement-1-drpc     9h    amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocated      Completed     2023-11-01T14:16:28Z   3h17m50.318760845s   True
openshift-gitops      appset-cephfs-placement-drpc           9h    amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     Completed     2023-11-01T13:20:45Z   5m59.4021061s        True
openshift-gitops      appset-rbd1-placement-drpc             9h    amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailedOver     Completed     2023-11-01T14:15:30Z   41m2.588884417s      True
openshift-gitops      appset-rbd2-placement-drpc             9h    amagrawa-passivee                                      Deployed       Completed                                                 True


from passive hub-

amagrawa:~$ drpc
NAMESPACE             NAME                                   AGE   PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION                           START TIME             DURATION   PEER READY
busybox-workloads-2   subscription-cephfs-placement-1-drpc   57m   amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocating                                           2023-11-01T18:59:35Z              False
busybox-workloads-5   subscription-rbd1-placement-1-drpc     57m   amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-01T18:59:36Z              False
busybox-workloads-6   subscription-rbd2-placement-1-drpc     57m   amagrawa-31o-prim   amagrawa-passivee   Relocate                                                                                              True
openshift-gitops      appset-cephfs-placement-drpc           57m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     EnsuringVolSyncSetup                                                    True
openshift-gitops      appset-rbd1-placement-drpc             57m   amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailingOver    FailingOverToCluster                  2023-11-01T18:59:36Z              False
openshift-gitops      appset-rbd2-placement-drpc             57m   amagrawa-passivee                                      Deployed       Completed                                                               True


Since peer ready is now marked as false due to sanity check, subscription-cephfs-placement-1-drpc and subscription-rbd1-placement-1-drpc and appset-rbd1-placement-drpc can not be failedover in this example.

This sanity check is needed as per k8s recommended guidelines and we should not backup the currentstate of the workloads as confirmed by @bmekhiss so the issue will always persist.

As of now, the only option is to trigger a failover by editing drpc yaml from CLI hence a force failover UI option is needed in this case with a caution that it may cause data loss/data corruption which would need to be tested.

Currently it's blocked due to BZ2246084 which we were able to repro again and would be updated later.


Actual results: Current UI cannot initiate failover of workloads which were in any other state than deployed before hub recovery was performed

Expected results: Allow a force failover of workloads post hub recovery where peer ready is false



Additional info:

Comment 6 Aman Agrawal 2024-01-16 09:11:17 UTC
Tested with
OCP 4.15.0-0.nightly-2024-01-03-015912
ACM GA'ed 2.9.1
ODF 4.15.0-104
ceph version 17.2.6-167.el9cp (5ef1496ea3e9daaa9788809a172bd5a1c3192cf7) quincy (stable)

Active hub co-situated with primary managed cluster

After site failure and moving to passive hub, Peer Ready for all the workloads was set to True which means we could trigger a failover operation via ACM console.


from active hub

amagrawa:hub$ drpc
NAMESPACE              NAME                                     AGE     PREFERREDCLUSTER   FAILOVERCLUSTER    DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-1    cephfs-sub-busybox1-placement-1-drpc     8d      amagrawa-c1-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:18:52Z   2m55.286783085s   True
busybox-workloads-10   rbd-sub-busybox10-placement-1-drpc       7h11m   amagrawa-c1-3jan                                     Deployed       Completed     2024-01-12T09:17:04Z   2.056054748s      True
busybox-workloads-11   rbd-sub-busybox11-placement-1-drpc       7h10m   amagrawa-c1-3jan                                     Deployed       Completed     2024-01-12T09:17:49Z   21.043985378s     True
busybox-workloads-12   rbd-sub-busybox12-placement-1-drpc       7h8m    amagrawa-c2-3jan                                     Deployed       Completed     2024-01-12T09:19:19Z   88.077165ms       True
busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    7h4m    amagrawa-c2-3jan                                     Deployed       Completed     2024-01-12T09:23:37Z   52.119470621s     True
busybox-workloads-2    rbd-sub-busybox2-placement-1-drpc        8d      amagrawa-c1-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:19:42Z   6m5.284279162s    True
busybox-workloads-5    rbd-sub-busybox5-placement-1-drpc        4d5h    amagrawa-c1-3jan                      Relocate       Relocated      Completed     2024-01-12T12:19:49Z   2m58.152248084s   True
busybox-workloads-6    cephfs-sub-busybox6-placement-1-drpc     7h18m   amagrawa-c1-3jan                      Relocate       Relocated      Completed     2024-01-12T12:18:59Z   2m53.478551336s   True
busybox-workloads-7    cephfs-sub-busybox7-placement-1-drpc     7h16m   amagrawa-c1-3jan                                     Deployed       Completed     2024-01-12T09:11:18Z   32.120615573s     True
openshift-gitops       cephfs-appset-busybox16-placement-drpc   7h3m    amagrawa-c2-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:19:33Z   2m37.50568738s    True
openshift-gitops       cephfs-appset-busybox3-placement-drpc    4d6h    amagrawa-c2-3jan   amagrawa-c2-3jan   Failover       FailedOver     Completed     2024-01-12T12:22:54Z   2m36.257186541s   True
openshift-gitops       cephfs-appset-busybox8-placement-drpc    7h14m   amagrawa-c1-3jan                      Relocate       Relocated      Completed     2024-01-12T12:19:20Z   4m10.339668753s   True
openshift-gitops       cephfs-appset-busybox9-placement-drpc    7h13m   amagrawa-c1-3jan                                     Deployed       Completed     2024-01-12T09:15:06Z   32.175780774s     True
openshift-gitops       rbd-appset-busybox13-placement-drpc      7h6m    amagrawa-c1-3jan                      Relocate       Relocated      Completed     2024-01-12T12:20:02Z   6m7.188328151s    True
openshift-gitops       rbd-appset-busybox14-placement-drpc      7h5m    amagrawa-c2-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:20:12Z   5m29.864938194s   True
openshift-gitops       rbd-appset-busybox17-placement-drpc      4h3m    amagrawa-c2-3jan                                     Deployed       Completed     2024-01-12T12:24:43Z   15.046600381s     True
openshift-gitops       rbd-appset-busybox4-placement-drpc       4d6h    amagrawa-c2-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:01:19Z   8m27.272353624s   True


from passive hub (when active hub and primary managed cluster is down after site failure)

amagrawa:acm$ drpc
NAMESPACE              NAME                                     AGE   PREFERREDCLUSTER   FAILOVERCLUSTER    DESIREDSTATE   CURRENTSTATE   PROGRESSION            START TIME             DURATION       PEER READY
busybox-workloads-1    cephfs-sub-busybox1-placement-1-drpc     39m   amagrawa-c1-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True
busybox-workloads-10   rbd-sub-busybox10-placement-1-drpc       39m   amagrawa-c1-3jan                                                    Paused                                                       True
busybox-workloads-11   rbd-sub-busybox11-placement-1-drpc       39m   amagrawa-c1-3jan                                                    Paused                                                       True
busybox-workloads-12   rbd-sub-busybox12-placement-1-drpc       39m   amagrawa-c2-3jan                                     Deployed       Completed              2024-01-12T16:52:33Z   963.058163ms   True
busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    39m   amagrawa-c2-3jan                                     Deployed       EnsuringVolSyncSetup   2024-01-12T16:53:31Z                  True
busybox-workloads-2    rbd-sub-busybox2-placement-1-drpc        39m   amagrawa-c1-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True
busybox-workloads-5    rbd-sub-busybox5-placement-1-drpc        39m   amagrawa-c1-3jan                      Relocate                      Paused                                                       True
busybox-workloads-6    cephfs-sub-busybox6-placement-1-drpc     39m   amagrawa-c1-3jan                      Relocate                      Paused                                                       True
busybox-workloads-7    cephfs-sub-busybox7-placement-1-drpc     39m   amagrawa-c1-3jan                                                    Paused                                                       True
openshift-gitops       cephfs-appset-busybox16-placement-drpc   39m   amagrawa-c2-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True
openshift-gitops       cephfs-appset-busybox3-placement-drpc    39m   amagrawa-c2-3jan   amagrawa-c2-3jan   Failover       FailedOver     Cleaning Up                                                  True
openshift-gitops       cephfs-appset-busybox8-placement-drpc    39m   amagrawa-c1-3jan                      Relocate                      Paused                                                       True
openshift-gitops       cephfs-appset-busybox9-placement-drpc    39m   amagrawa-c1-3jan                                                    Paused                                                       True
openshift-gitops       rbd-appset-busybox13-placement-drpc      39m   amagrawa-c1-3jan                      Relocate                      Paused                                                       True
openshift-gitops       rbd-appset-busybox14-placement-drpc      39m   amagrawa-c2-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True
openshift-gitops       rbd-appset-busybox17-placement-drpc      39m   amagrawa-c2-3jan                                     Deployed       Completed              2024-01-12T16:52:32Z   1.263215882s   True
openshift-gitops       rbd-appset-busybox4-placement-drpc       39m   amagrawa-c2-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True


Benamar, do you think we can close this bug based upon this observation?

Comment 7 Mudit Agarwal 2024-01-23 10:16:00 UTC
Not a 4.15 blocker

Comment 11 Benamar Mekhissi 2024-05-06 12:22:40 UTC
@amagrawa we can close this one.

Comment 12 Aman Agrawal 2024-05-06 12:39:06 UTC
Closing based upon observation from Comment6 and confirmation in Comment11.


Note You need to log in before you can comment on or make changes to this bug.