Bug 2056871

Summary: [DR] When failover and relocate is done within few minutes, volumereplication desired state on both the managed clusters are marked as Secondary and relocate doesn't happen
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: odf-drAssignee: jmishra
odf-dr sub component: ramen QA Contact: Aman Agrawal <amagrawa>
Status: CLOSED CANTFIX Docs Contact:
Severity: high    
Priority: unspecified CC: bmekhiss, ebenahar, jcall, jespy, jmishra, kramdoss, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, olakra, owasserm, prsurve, rtalur, sagrawal, sheggodu, srangana
Version: 4.10   
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.12.0-74 Doc Type: Known Issue
Doc Text:
.Relocation fails when failover and relocate is performed within a few minutes When the user starts relocating an application from one cluster to another before the `PeerReady` condition status is `TRUE`, the condition status is seen through the DRPC YAML file or by running the following `oc` command: `oc get drpc -o yaml -n busybox-workloads-1` where `busybox-workloads-1` is the namespace where the workloads are present for deploying the sample application. If the Relocation is initiated before the peer (target cluster) is in a clean state, then the relocation will stall forever. Workaround: Change the DRPC .Spec.Action back to `Failover`, and wait until the `PeerReady` condition status is TRUE. After applying the workaround, change the Action to Relocate, and the relocation will take effect.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-04 12:45:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2094357    

Comment 3 Mudit Agarwal 2022-03-08 13:45:36 UTC
Shyam, should this be a blocker for 4.10?

Comment 4 Shyamsundar 2022-03-08 17:27:59 UTC
(In reply to Mudit Agarwal from comment #3)
> Shyam, should this be a blocker for 4.10?

Yes, fix is WIP and should land this week. Changing assignee as well to Jolly.

Comment 7 Mudit Agarwal 2022-03-15 12:12:03 UTC
Please backport this to release-4.10

Comment 11 jmishra 2022-04-11 02:16:13 UTC
Please provide must-gather logs.

Comment 13 Benamar Mekhissi 2022-04-11 10:55:10 UTC
1. Multiple primaries in the log are seen because the failover cleanup was never completed.  If you look at the DRPC status condition you will see the following:
```
      lastTransitionTime: "2022-04-10T12:24:55Z"
      message: Started failover to cluster "amagrawa-c2-8ap"
      observedGeneration: 4
      reason: NotStarted
      status: "False"
      type: PeerReady
```

2. Attempting a relocation at that point will not work and actually, it messes things up.  The action should be put back to `Failover` until the condition above is set to true.

3. The PVCs stuck in terminating state are separate from the issue in (1) and need to be looked at separately which I'll do next.

Comment 14 Benamar Mekhissi 2022-04-11 11:13:09 UTC
The PVC stuck in a terminating state is because the request to set the VRG on C1 to secondary was never issued because of the issue above in (1). In other words, that's normal behavior.

To get out of this stuckness, change the Action back to Failover and then wait for the DRPC PeerReady condition status to change to True.

You might want to open a low priority BZ against DRPC log logging at level verbose, which makes it difficult to diagnose issues when we have 100s of PVCs.

Comment 15 Mudit Agarwal 2022-04-11 11:40:31 UTC
Moving DR BZs out of 4.10

Comment 21 Mudit Agarwal 2022-06-29 13:30:13 UTC
What is the plan for this BZ in 4.11, this has no update for 20 days.

Comment 24 Mudit Agarwal 2022-07-05 10:13:12 UTC
Not a TP blocker, we have a workaround. Moving it out of 4.11, please revert if my understanding is wrong.

Comment 28 jmishra 2022-08-09 15:56:02 UTC
User need to wait for failover to finish which includes cleanup to complete. DRPC status PeerReady should be TRUE to proceed with relocate. This is the normal behavior now. No code fix can be done for this.