Bug 2056871
| Summary: | [DR] When failover and relocate is done within few minutes, volumereplication desired state on both the managed clusters are marked as Secondary and relocate doesn't happen | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aman Agrawal <amagrawa> |
| Component: | odf-dr | Assignee: | jmishra |
| odf-dr sub component: | ramen | QA Contact: | Aman Agrawal <amagrawa> |
| Status: | CLOSED CANTFIX | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | bmekhiss, ebenahar, jcall, jespy, jmishra, kramdoss, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, olakra, owasserm, prsurve, rtalur, sagrawal, sheggodu, srangana |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | ODF 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.12.0-74 | Doc Type: | Known Issue |
| Doc Text: |
.Relocation fails when failover and relocate is performed within a few minutes
When the user starts relocating an application from one cluster to another before the `PeerReady` condition status is `TRUE`, the condition status is seen through the DRPC YAML file or by running the following `oc` command: `oc get drpc -o yaml -n busybox-workloads-1`
where `busybox-workloads-1` is the namespace where the workloads are present for deploying the sample application.
If the Relocation is initiated before the peer (target cluster) is in a clean state, then the relocation will stall forever.
Workaround: Change the DRPC .Spec.Action back to `Failover`, and wait until the `PeerReady` condition status is TRUE. After applying the workaround, change the Action to Relocate, and the relocation will take effect.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-04 12:45:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2094357 | ||
|
Comment 3
Mudit Agarwal
2022-03-08 13:45:36 UTC
(In reply to Mudit Agarwal from comment #3) > Shyam, should this be a blocker for 4.10? Yes, fix is WIP and should land this week. Changing assignee as well to Jolly. Please backport this to release-4.10 Please provide must-gather logs. 1. Multiple primaries in the log are seen because the failover cleanup was never completed. If you look at the DRPC status condition you will see the following:
```
lastTransitionTime: "2022-04-10T12:24:55Z"
message: Started failover to cluster "amagrawa-c2-8ap"
observedGeneration: 4
reason: NotStarted
status: "False"
type: PeerReady
```
2. Attempting a relocation at that point will not work and actually, it messes things up. The action should be put back to `Failover` until the condition above is set to true.
3. The PVCs stuck in terminating state are separate from the issue in (1) and need to be looked at separately which I'll do next.
The PVC stuck in a terminating state is because the request to set the VRG on C1 to secondary was never issued because of the issue above in (1). In other words, that's normal behavior. To get out of this stuckness, change the Action back to Failover and then wait for the DRPC PeerReady condition status to change to True. You might want to open a low priority BZ against DRPC log logging at level verbose, which makes it difficult to diagnose issues when we have 100s of PVCs. Moving DR BZs out of 4.10 What is the plan for this BZ in 4.11, this has no update for 20 days. Not a TP blocker, we have a workaround. Moving it out of 4.11, please revert if my understanding is wrong. User need to wait for failover to finish which includes cleanup to complete. DRPC status PeerReady should be TRUE to proceed with relocate. This is the normal behavior now. No code fix can be done for this. |