2240908 – [RDR] [UI] DR UI allows Relocate and Failover to same peer causing the Failover to get stuck in Wait_For_Fencing progression

Bug 2240908 - [RDR] [UI] DR UI allows Relocate and Failover to same peer causing the Failover to get stuck in Wait_For_Fencing progression

Summary: [RDR] [UI] DR UI allows Relocate and Failover to same peer causing the Failov...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Benamar Mekhissi
QA Contact:	Sidhant Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-27 08:15 UTC by Sidhant Agrawal
Modified:	2024-03-19 15:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.15.0-147
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:25:02 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage odf-console pull 1153	None	open	[RDR] [UI] DR UI allows Relocate and Failover to same peer causing the failover stuck - WIP	2024-01-03 19:03:07 UTC
Github	red-hat-storage odf-console pull 1169	None	open	Bug 2240908: [release-4.15] [RDR] [UI] DR UI allows Relocate and Failover to same peer causing the failover stuck	2024-01-11 07:32:08 UTC
Github	red-hat-storage odf-console pull 1170	None	open	Bug 2240908: [release-4.15-compatible] [RDR] [UI] DR UI allows Relocate and Failover to same peer causing the failover s...	2024-01-10 12:45:51 UTC
Github	red-hat-storage ramen pull 193	None	open	Bug 2240908: Check for metro policy using policy clusters	2024-02-16 02:39:01 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:25:05 UTC

Description Sidhant Agrawal 2023-09-27 08:15:14 UTC

Description of problem (please be detailed as possible and provide log
snippests):
This bug is being raised after the discussion in Bug 2138855


On a RDR setup, after initiating Relocate action, until PeerReady is set as False , UI allows Failover action to be triggered to the same peer cluster.
At this time, both PreferredCluster and FailoverCluster are pointing to same cluster.
This misconfiguration causes Failover to get stuck in WaitForFencing progression. See comment https://bugzilla.redhat.com/show_bug.cgi?id=2138855#c43 for more details.

Version of all relevant components (if applicable):
OCP: 4.14.0-ec.4
ODF: 4.14.0-134

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Failover will get stuck with WaitForFencing

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a RDR setup, deploy an ApplicationSet based application

NAMESPACE          NAME                       AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME   DURATION   PEER READY
openshift-gitops   busybox-1-placement-drpc   9m20s   sagrawal-nc1                                        Deployed       Completed                             True


2. Relocate from C1 to C2 and wait for it to complete.

NAMESPACE          NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
openshift-gitops   busybox-1-placement-drpc   13m   sagrawal-nc2                         Relocate       Relocated      Completed     2023-09-27T06:48:27Z   4m12.200921726s   True


3. Again initiate Relocate from C2 to C1 , and when DRPC shows progression as RunningFinalSync, initiate Failover action from C2 to C1

Wed Sep 27 06:53:28 UTC 2023
NAMESPACE          NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION          START TIME             DURATION   PEER READY
openshift-gitops   busybox-1-placement-drpc   14m   sagrawal-nc1                         Relocate       Initiating     PreparingFinalSync   2023-09-27T06:53:28Z              True

Wed Sep 27 06:53:37 UTC 2023
NAMESPACE          NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION        START TIME             DURATION   PEER READY
openshift-gitops   busybox-1-placement-drpc   14m   sagrawal-nc1                         Relocate       Relocating     RunningFinalSync   2023-09-27T06:53:28Z              True

Wed Sep 27 06:53:40 UTC 2023
NAMESPACE          NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION      START TIME             DURATION   PEER READY
openshift-gitops   busybox-1-placement-drpc   14m   sagrawal-nc1       sagrawal-nc1      Failover       FailingOver    WaitForFencing   2023-09-27T06:53:28Z              False


Actual results:
Failover operation remain stuck forever with WaitForFencing

Expected results:
Failover operation proceed and should be successful.

Additional info:

DRPC yaml out when Failover is stuck:
---
$ oc get drpc -A -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPlacementControl
  metadata:
    annotations:
      drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: sagrawal-nc2
    creationTimestamp: "2023-09-27T06:39:07Z"
    finalizers:
    - drpc.ramendr.openshift.io/finalizer
    generation: 4
    labels:
      cluster.open-cluster-management.io/backup: resource
    name: busybox-1-placement-drpc
    namespace: openshift-gitops
    ownerReferences:
    - apiVersion: cluster.open-cluster-management.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: Placement
      name: busybox-1-placement
      uid: 90edda56-75e9-409a-9fb1-c81fbd389966
    resourceVersion: "2615018"
    uid: fd1d105d-6415-4606-aa29-524d9ca693e2
  spec:
    action: Failover
    drPolicyRef:
      apiVersion: ramendr.openshift.io/v1alpha1
      kind: DRPolicy
      name: odr-policy-10m
    failoverCluster: sagrawal-nc1
    placementRef:
      apiVersion: cluster.open-cluster-management.io/v1beta1
      kind: Placement
      name: busybox-1-placement
      namespace: openshift-gitops
    preferredCluster: sagrawal-nc1
    pvcSelector:
      matchLabels:
        appname: busybox_app1
  status:
    actionStartTime: "2023-09-27T06:53:28Z"
    conditions:
    - lastTransitionTime: "2023-09-27T06:53:39Z"
      message: current home cluster sagrawal-nc1 is not fenced
      observedGeneration: 4
      reason: FailingOver
      status: "False"
      type: Available
    - lastTransitionTime: "2023-09-27T06:53:39Z"
      message: Started failover to cluster "sagrawal-nc1"
      observedGeneration: 4
      reason: NotStarted
      status: "False"
      type: PeerReady
    lastGroupSyncBytes: 6316032
    lastGroupSyncDuration: 0s
    lastGroupSyncTime: "2023-09-27T06:40:00Z"
    lastUpdateTime: "2023-09-27T06:53:39Z"
    phase: FailingOver
    preferredDecision:
      clusterName: sagrawal-nc1
      clusterNamespace: sagrawal-nc1
    progression: WaitForFencing
    resourceConditions:
      conditions:
      - lastTransitionTime: "2023-09-27T06:51:57Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-09-27T06:51:48Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2023-09-27T06:51:41Z"
        message: Restored cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2023-09-27T06:51:57Z"
        message: Cluster data of all PVs are protected
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      resourceMeta:
        generation: 1
        kind: VolumeReplicationGroup
        name: busybox-1-placement-drpc
        namespace: busybox-1
        protectedpvcs:
        - busybox-pvc-1
        - busybox-pvc-10
        - busybox-pvc-11
        - busybox-pvc-12
        - busybox-pvc-13
        - busybox-pvc-14
        - busybox-pvc-15
        - busybox-pvc-16
        - busybox-pvc-17
        - busybox-pvc-18
        - busybox-pvc-19
        - busybox-pvc-2
        - busybox-pvc-20
        - busybox-pvc-3
        - busybox-pvc-4
        - busybox-pvc-5
        - busybox-pvc-6
        - busybox-pvc-7
        - busybox-pvc-8
        - busybox-pvc-9
kind: List
metadata:
  resourceVersion: ""
---

Comment 15 errata-xmlrpc 2024-03-19 15:25:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.