Bug 2240908

Summary: [RDR] [UI] DR UI allows Relocate and Failover to same peer causing the Failover to get stuck in Wait_For_Fencing progression
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sidhant Agrawal <sagrawal>
Component: odf-drAssignee: Benamar Mekhissi <bmekhiss>
odf-dr sub component: ramen QA Contact: Sidhant Agrawal <sagrawal>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: bmekhiss, gshanmug, kramdoss, kseeger, muagarwa, nthomas, odf-bz-bot, skatiyar
Version: 4.14   
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-147 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-19 15:25:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sidhant Agrawal 2023-09-27 08:15:14 UTC
Description of problem (please be detailed as possible and provide log
snippests):
This bug is being raised after the discussion in Bug 2138855


On a RDR setup, after initiating Relocate action, until PeerReady is set as False , UI allows Failover action to be triggered to the same peer cluster.
At this time, both PreferredCluster and FailoverCluster are pointing to same cluster.
This misconfiguration causes Failover to get stuck in WaitForFencing progression. See comment https://bugzilla.redhat.com/show_bug.cgi?id=2138855#c43 for more details.

Version of all relevant components (if applicable):
OCP: 4.14.0-ec.4
ODF: 4.14.0-134

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Failover will get stuck with WaitForFencing

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a RDR setup, deploy an ApplicationSet based application

NAMESPACE          NAME                       AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME   DURATION   PEER READY
openshift-gitops   busybox-1-placement-drpc   9m20s   sagrawal-nc1                                        Deployed       Completed                             True


2. Relocate from C1 to C2 and wait for it to complete.

NAMESPACE          NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
openshift-gitops   busybox-1-placement-drpc   13m   sagrawal-nc2                         Relocate       Relocated      Completed     2023-09-27T06:48:27Z   4m12.200921726s   True


3. Again initiate Relocate from C2 to C1 , and when DRPC shows progression as RunningFinalSync, initiate Failover action from C2 to C1

Wed Sep 27 06:53:28 UTC 2023
NAMESPACE          NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION          START TIME             DURATION   PEER READY
openshift-gitops   busybox-1-placement-drpc   14m   sagrawal-nc1                         Relocate       Initiating     PreparingFinalSync   2023-09-27T06:53:28Z              True

Wed Sep 27 06:53:37 UTC 2023
NAMESPACE          NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION        START TIME             DURATION   PEER READY
openshift-gitops   busybox-1-placement-drpc   14m   sagrawal-nc1                         Relocate       Relocating     RunningFinalSync   2023-09-27T06:53:28Z              True

Wed Sep 27 06:53:40 UTC 2023
NAMESPACE          NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION      START TIME             DURATION   PEER READY
openshift-gitops   busybox-1-placement-drpc   14m   sagrawal-nc1       sagrawal-nc1      Failover       FailingOver    WaitForFencing   2023-09-27T06:53:28Z              False


Actual results:
Failover operation remain stuck forever with WaitForFencing

Expected results:
Failover operation proceed and should be successful.

Additional info:

DRPC yaml out when Failover is stuck:
---
$ oc get drpc -A -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPlacementControl
  metadata:
    annotations:
      drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: sagrawal-nc2
    creationTimestamp: "2023-09-27T06:39:07Z"
    finalizers:
    - drpc.ramendr.openshift.io/finalizer
    generation: 4
    labels:
      cluster.open-cluster-management.io/backup: resource
    name: busybox-1-placement-drpc
    namespace: openshift-gitops
    ownerReferences:
    - apiVersion: cluster.open-cluster-management.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: Placement
      name: busybox-1-placement
      uid: 90edda56-75e9-409a-9fb1-c81fbd389966
    resourceVersion: "2615018"
    uid: fd1d105d-6415-4606-aa29-524d9ca693e2
  spec:
    action: Failover
    drPolicyRef:
      apiVersion: ramendr.openshift.io/v1alpha1
      kind: DRPolicy
      name: odr-policy-10m
    failoverCluster: sagrawal-nc1
    placementRef:
      apiVersion: cluster.open-cluster-management.io/v1beta1
      kind: Placement
      name: busybox-1-placement
      namespace: openshift-gitops
    preferredCluster: sagrawal-nc1
    pvcSelector:
      matchLabels:
        appname: busybox_app1
  status:
    actionStartTime: "2023-09-27T06:53:28Z"
    conditions:
    - lastTransitionTime: "2023-09-27T06:53:39Z"
      message: current home cluster sagrawal-nc1 is not fenced
      observedGeneration: 4
      reason: FailingOver
      status: "False"
      type: Available
    - lastTransitionTime: "2023-09-27T06:53:39Z"
      message: Started failover to cluster "sagrawal-nc1"
      observedGeneration: 4
      reason: NotStarted
      status: "False"
      type: PeerReady
    lastGroupSyncBytes: 6316032
    lastGroupSyncDuration: 0s
    lastGroupSyncTime: "2023-09-27T06:40:00Z"
    lastUpdateTime: "2023-09-27T06:53:39Z"
    phase: FailingOver
    preferredDecision:
      clusterName: sagrawal-nc1
      clusterNamespace: sagrawal-nc1
    progression: WaitForFencing
    resourceConditions:
      conditions:
      - lastTransitionTime: "2023-09-27T06:51:57Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-09-27T06:51:48Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2023-09-27T06:51:41Z"
        message: Restored cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2023-09-27T06:51:57Z"
        message: Cluster data of all PVs are protected
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      resourceMeta:
        generation: 1
        kind: VolumeReplicationGroup
        name: busybox-1-placement-drpc
        namespace: busybox-1
        protectedpvcs:
        - busybox-pvc-1
        - busybox-pvc-10
        - busybox-pvc-11
        - busybox-pvc-12
        - busybox-pvc-13
        - busybox-pvc-14
        - busybox-pvc-15
        - busybox-pvc-16
        - busybox-pvc-17
        - busybox-pvc-18
        - busybox-pvc-19
        - busybox-pvc-2
        - busybox-pvc-20
        - busybox-pvc-3
        - busybox-pvc-4
        - busybox-pvc-5
        - busybox-pvc-6
        - busybox-pvc-7
        - busybox-pvc-8
        - busybox-pvc-9
kind: List
metadata:
  resourceVersion: ""
---

Comment 15 errata-xmlrpc 2024-03-19 15:25:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383