2250152 – [RDR] [Hub recovery] Sync for all cephfs workloads stopped post hub recovery

Bug 2250152 - [RDR] [Hub recovery] Sync for all cephfs workloads stopped post hub recovery

Summary: [RDR] [Hub recovery] Sync for all cephfs workloads stopped post hub recovery

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Benamar Mekhissi
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2251205
TreeView+	depends on / blocked

Reported:	2023-11-16 19:05 UTC by Aman Agrawal
Modified:	2024-07-18 04:25 UTC (History)
CC List:	2 users (show)
Fixed In Version:	4.15.0-102
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2251205 (view as bug list)
Environment:
Last Closed:	2024-03-19 15:29:02 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	RamenDR ramen pull 1144	0	None	open	Add label to volsync secret for inclusion in hub recovery backup	2023-11-21 19:33:00 UTC
Red Hat Product Errata	RHSA-2024:1383	0	None	None	None	2024-03-19 15:29:05 UTC

Description Aman Agrawal 2023-11-16 19:05:24 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-09-204811
Volsync 0.8.0
Submariner 0.16.2
ACM quay.io:443/acm-d/acm-custom-registry:v2.9.0-RC2        
odf-multicluster-orchestrator.v4.14.1-rhodf  
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

**Active hub at neutral site**

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster). (A few of them are exception, check drpc -o wide status in Step 3).
3. Ensure that we have the workloads in distict states like deployed, failedover, relocated etc.

Here amagrawa-10n-1 is C1 primary managed cluster for me:

From active hub-

amagrawa:hub$ drpc
NAMESPACE              NAME                                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-12   cephfs-sub-busybox-workloads-12-placement-1-drpc    7h18m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:54:29Z   5m59.196575462s   True
busybox-workloads-13   cephfs-sub-busybox-workloads-13-placement-1-drpc    7h17m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T12:12:36Z   5m58.842880173s   True
busybox-workloads-14   cephfs-sub-busybox-workloads-14-placement-1-drpc    7h16m   amagrawa-10n-1     amagrawa-10n-2    Failover       FailedOver     Completed     2023-11-16T08:29:07Z   3m19.098202668s   True
busybox-workloads-6    rbd-sub-busybox-workloads-6-placement-1-drpc        7h35m   amagrawa-10n-1                                      Deployed       Completed                                              True
busybox-workloads-7    rbd-sub-busybox-workloads-7-placement-1-drpc        7h34m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:53:38Z   9m59.85663627s    True
busybox-workloads-8    rbd-sub-busybox-workloads-8-placement-1-drpc        7h32m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:21:05Z   4m13.272955733s   True
openshift-gitops       cephfs-appset-busybox-workloads-10-placement-drpc   7h22m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:15:50Z   3m22.540081438s   True
openshift-gitops       cephfs-appset-busybox-workloads-11-placement-drpc   7h20m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:00:32Z   5m38.794985745s   True
openshift-gitops       cephfs-appset-busybox-workloads-9-placement-drpc    7h24m   amagrawa-10n-2                       Relocate       Relocated      Completed     2023-11-16T08:28:59Z   8m47.541429779s   True
openshift-gitops       rbd-appset-busybox-workloads-1-placement-drpc       7h43m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:16:14Z   8m31.330049487s   True
openshift-gitops       rbd-appset-busybox-workloads-2-placement-drpc       7h42m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:16:28Z   7m59.477897296s   True
openshift-gitops       rbd-appset-busybox-workloads-3-placement-drpc       7h41m   amagrawa-10n-2     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:27:18Z   7m4.760183798s    True
openshift-gitops       rbd-appset-busybox-workloads-4-placement-drpc       7h39m   amagrawa-10n-1                                      Deployed       Completed                                              True

4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.

amagrawa:hub$ group|grep SyncTime
    lastGroupSyncTime: "2023-11-16T14:01:32Z"
    lastGroupSyncTime: "2023-11-16T14:06:09Z"
    lastGroupSyncTime: "2023-11-16T14:01:03Z"
    lastGroupSyncTime: "2023-11-16T13:45:09Z"
    lastGroupSyncTime: "2023-11-16T13:50:51Z"
    lastGroupSyncTime: "2023-11-16T13:50:40Z"
    lastGroupSyncTime: "2023-11-16T14:00:51Z"
    lastGroupSyncTime: "2023-11-16T14:06:12Z"
    lastGroupSyncTime: "2023-11-16T13:01:45Z"
    lastGroupSyncTime: "2023-11-16T13:50:36Z"
    lastGroupSyncTime: "2023-11-16T13:45:16Z"
    lastGroupSyncTime: "2023-11-16T13:56:22Z"
    lastGroupSyncTime: "2023-11-16T13:45:11Z"


amagrawa:hub$ date -u
Thursday 16 November 2023 02:12:11 PM UTC

5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.

They seem to have retained their last state which was backedup. So everything is fine so far.

amagrawa:~$ drpc
NAMESPACE              NAME                                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME   DURATION   PEER READY
busybox-workloads-12   cephfs-sub-busybox-workloads-12-placement-1-drpc    4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
busybox-workloads-13   cephfs-sub-busybox-workloads-13-placement-1-drpc    4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
busybox-workloads-14   cephfs-sub-busybox-workloads-14-placement-1-drpc    4h16m   amagrawa-10n-1     amagrawa-10n-2    Failover       FailedOver     Completed                             True
busybox-workloads-6    rbd-sub-busybox-workloads-6-placement-1-drpc        4h16m   amagrawa-10n-1                                      Deployed       Completed                             True
busybox-workloads-7    rbd-sub-busybox-workloads-7-placement-1-drpc        4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
busybox-workloads-8    rbd-sub-busybox-workloads-8-placement-1-drpc        4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-10-placement-drpc   4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-11-placement-drpc   4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-9-placement-drpc    4h16m   amagrawa-10n-2                       Relocate       Relocated      Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-1-placement-drpc       4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-2-placement-drpc       4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-3-placement-drpc       4h16m   amagrawa-10n-2     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-4-placement-drpc       4h16m   amagrawa-10n-1                                      Deployed       Completed                             True

7. Let IOs continue for a few hours. We observed that data sync for rbd based workloads were progressing just fine but sync stopped for all the cephfs based workloads be it of subsciption or appset type.


Actual results: Sync for all cephfs workloads stopped post hub recovery.

Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/16nov23/logs/

VolumeSyncronizationDelay alert fires on passive hub for all cephfs workloads when monitoring label is applied.

Expected results: Sync for all cephfs workloads should continue without any issues post hub recovery.


Additional info:

Comment 12 errata-xmlrpc 2024-03-19 15:29:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Comment 13 Red Hat Bugzilla 2024-07-18 04:25:11 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.