2264765 – [RDR] [Hub recovery] [Neutral] CephFS workload changes it's state from Relocated to Relocating on node failure

Bug 2264765 - [RDR] [Hub recovery] [Neutral] CephFS workload changes it's state from Relocated to Relocating on node failure [NEEDINFO]

Summary: [RDR] [Hub recovery] [Neutral] CephFS workload changes it's state from Reloca...

Keywords:
Status:	POST
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.18.0
Assignee:	Benamar Mekhissi
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2246375
TreeView+	depends on / blocked

Reported:	2024-02-18 19:54 UTC by Aman Agrawal
Modified:	2024-10-18 07:28 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Cause: The DR controller executes full reconciliation as needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload has already been relocated, this sanity check causes the PeerReady flag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline. Consequence: Disabling the PeerReady flag prevents users from changing the action to Failover. Workaround: In this scenario, users must use the CLI to address this issue. Result: Using the CLI enables users to change the DR action to Failover despite the disabled PeerReady flag.
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	kseeger: needinfo? (bmekhiss) sheggodu: needinfo? (bmekhiss) edonnell: needinfo? (bmekhiss)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	RamenDR ramen pull 1479	0	None	open	Add operational-mode annotation to resolve VRG state ambiguity during…	2024-07-02 11:06:00 UTC

Description Aman Agrawal 2024-02-18 19:54:00 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
ODF 4.15.0-132.stable
OCP 4.15.0-0.nightly-2024-02-13-231030
ACM 2.9.2 GA'ed
Submariner 0.16.3
ceph version 17.2.6-194.el9cp (d9f4aedda0fc0d99e7e0e06892a69523d2eb06dc) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
**Active hub at neutral site**

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster) but the apps which were failedover from C1 to C2 were relocated back to C1 and the apps which were relocated to C2 were failedover to C1 (with all nodes up and running).
Ensure that we have all workloads combinations in distinct states like deployed, failedover, relocated on C1, and a few workloads in deployed state on C2 as well.
4. Let the latest backups be taken at least 1 for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.
5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.
They seem to have retained their last state which was backedup. So everything is fine so far.
7. Let IOs continue for a few hours (20-30hrs). Failover the cephfs workloads running on C2 to C1 with all nodes of C2 up and running.
8. After successful failover and cleanup, wait for sync to resume and after some time bring primary cluster down (all nodes). Bring it up after a few hours.
9. Check if drpc state is still the same and data sync for all workload is resuming as expected.
10. After a few hours, bring master nodes of primary cluster down and check drpc again.

(Older hub remains down forever and is completely unreachable).


The below CephFS app changes it's state from Relocated to Relocating without any action on it.

Before==>

busybox-workloads-5    sub-cephfs-busybox5-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Relocate       Relocated      Completed     2024-02-17T14:56:29Z   18h39m39.024430842s   True


After==>

busybox-workloads-5    sub-cephfs-busybox5-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Relocate       Relocating                      2024-02-18T19:16:07Z                     False

Since it assumes that relocate operation for this workload is in progress, it couldn't be failedover as peer ready becomes false, while all other workloads running on primary were successfully failedover after the master nodes of the primary cluster went down.

amagrawa:hub$ drpc
NAMESPACE              NAME                                    AGE    PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION        START TIME             DURATION       PEER READY
busybox-workloads-13   sub-rbd-busybox13-placement-1-drpc      2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:20:17Z                  False
busybox-workloads-14   sub-rbd-busybox14-placement-1-drpc      2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:20:25Z                  False
busybox-workloads-15   sub-rbd-busybox15-placement-1-drpc      2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:20:33Z                  False
busybox-workloads-16   sub-rbd-busybox16-placement-1-drpc      2d9h   amagrawa-odf2                                       Deployed       Completed          2024-02-16T10:12:51Z   660.371688ms   True
busybox-workloads-5    sub-cephfs-busybox5-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Relocate       Relocating                        2024-02-18T19:16:07Z                  False
busybox-workloads-6    sub-cephfs-busybox6-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:19:49Z                  False
busybox-workloads-7    sub-cephfs-busybox7-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:19:59Z                  False
busybox-workloads-8    sub-cephfs-busybox8-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:20:06Z                  False
openshift-gitops       appset-cephfs-busybox1-placement-drpc   2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:18:33Z                  False
openshift-gitops       appset-cephfs-busybox2-placement-drpc   2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:18:38Z                  False
openshift-gitops       appset-cephfs-busybox3-placement-drpc   2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:18:43Z                  False
openshift-gitops       appset-cephfs-busybox4-placement-drpc   2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:18:48Z                  False
openshift-gitops       appset-rbd-busybox10-placement-drpc     2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:18:52Z                  False
openshift-gitops       appset-rbd-busybox11-placement-drpc     2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:18:58Z                  False
openshift-gitops       appset-rbd-busybox12-placement-drpc     2d9h   amagrawa-odf2                                       Deployed       Completed          2024-02-16T10:13:47Z   571.259493ms   True
openshift-gitops       appset-rbd-busybox9-placement-drpc      2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:19:17Z                  False


This leads to inaccessibility of this application. 

Actual results: [RDR] [Hub recovery] CephFS workload changes it's state from Relocated to Relocating on node failure

Expected results: Applications should retain their original state after node failure so that they could be failedover.


Additional info:

Comment 12 Benamar Mekhissi 2024-03-04 20:37:22 UTC

Updated doc text section

Comment 17 Aman Agrawal 2024-05-15 09:27:44 UTC

As discussed, proposing it back to 4.16

Comment 22 Aman Agrawal 2024-05-30 15:32:41 UTC

Just for reference, this issue hit again while testing with following versions:

OCP 4.16.0-0.nightly-2024-05-23-173505

ACM 2.11.0-DOWNSTREAM-2024-05-23-15-16-26

MCE 2.6.0-104 

ODF 4.16.0-108.stable

Gitops v1.12.3

Note You need to log in before you can comment on or make changes to this bug.