Bug 2056871 - [DR] When failover and relocate is done within few minutes, volumereplication desired state on both the managed clusters are marked as Secondary and relocate doesn't happen
Summary: [DR] When failover and relocate is done within few minutes, volumereplication...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.12.0
Assignee: jmishra
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On:
Blocks: 2094357
TreeView+ depends on / blocked
 
Reported: 2022-02-22 09:02 UTC by Aman Agrawal
Modified: 2023-08-09 17:00 UTC (History)
18 users (show)

Fixed In Version: 4.12.0-74
Doc Type: Known Issue
Doc Text:
.Relocation fails when failover and relocate is performed within a few minutes When the user starts relocating an application from one cluster to another before the `PeerReady` condition status is `TRUE`, the condition status is seen through the DRPC YAML file or by running the following `oc` command: `oc get drpc -o yaml -n busybox-workloads-1` where `busybox-workloads-1` is the namespace where the workloads are present for deploying the sample application. If the Relocation is initiated before the peer (target cluster) is in a clean state, then the relocation will stall forever. Workaround: Change the DRPC .Spec.Action back to `Failover`, and wait until the `PeerReady` condition status is TRUE. After applying the workaround, change the Action to Relocate, and the relocation will take effect.
Clone Of:
Environment:
Last Closed: 2023-01-04 12:45:25 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github RamenDR ramen pull 403 0 None open Relocation stuck issue 398: Made changes in drpc 2022-03-14 13:10:18 UTC
Github red-hat-storage ramen pull 20 0 None open Bug 2056871: Relocation stuck issue 398: Made initial changes in drpc 2022-03-15 13:46:07 UTC

Comment 3 Mudit Agarwal 2022-03-08 13:45:36 UTC
Shyam, should this be a blocker for 4.10?

Comment 4 Shyamsundar 2022-03-08 17:27:59 UTC
(In reply to Mudit Agarwal from comment #3)
> Shyam, should this be a blocker for 4.10?

Yes, fix is WIP and should land this week. Changing assignee as well to Jolly.

Comment 7 Mudit Agarwal 2022-03-15 12:12:03 UTC
Please backport this to release-4.10

Comment 11 jmishra 2022-04-11 02:16:13 UTC
Please provide must-gather logs.

Comment 13 Benamar Mekhissi 2022-04-11 10:55:10 UTC
1. Multiple primaries in the log are seen because the failover cleanup was never completed.  If you look at the DRPC status condition you will see the following:
```
      lastTransitionTime: "2022-04-10T12:24:55Z"
      message: Started failover to cluster "amagrawa-c2-8ap"
      observedGeneration: 4
      reason: NotStarted
      status: "False"
      type: PeerReady
```

2. Attempting a relocation at that point will not work and actually, it messes things up.  The action should be put back to `Failover` until the condition above is set to true.

3. The PVCs stuck in terminating state are separate from the issue in (1) and need to be looked at separately which I'll do next.

Comment 14 Benamar Mekhissi 2022-04-11 11:13:09 UTC
The PVC stuck in a terminating state is because the request to set the VRG on C1 to secondary was never issued because of the issue above in (1). In other words, that's normal behavior.

To get out of this stuckness, change the Action back to Failover and then wait for the DRPC PeerReady condition status to change to True.

You might want to open a low priority BZ against DRPC log logging at level verbose, which makes it difficult to diagnose issues when we have 100s of PVCs.

Comment 15 Mudit Agarwal 2022-04-11 11:40:31 UTC
Moving DR BZs out of 4.10

Comment 21 Mudit Agarwal 2022-06-29 13:30:13 UTC
What is the plan for this BZ in 4.11, this has no update for 20 days.

Comment 24 Mudit Agarwal 2022-07-05 10:13:12 UTC
Not a TP blocker, we have a workaround. Moving it out of 4.11, please revert if my understanding is wrong.

Comment 28 jmishra 2022-08-09 15:56:02 UTC
User need to wait for failover to finish which includes cleanup to complete. DRPC status PeerReady should be TRUE to proceed with relocate. This is the normal behavior now. No code fix can be done for this.


Note You need to log in before you can comment on or make changes to this bug.