2252756 – [RDR] [Hub recovery] [CephFS] volumes are lost on the secondary managed cluster after hub recovery

Bug 2252756 - [RDR] [Hub recovery] [CephFS] volumes are lost on the secondary managed cluster after hub recovery

Summary: [RDR] [Hub recovery] [CephFS] volumes are lost on the secondary managed clust...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Benamar Mekhissi
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2252880
TreeView+	depends on / blocked

Reported:	2023-12-04 14:26 UTC by Aman Agrawal
Modified:	2024-07-18 04:25 UTC (History)
CC List:	3 users (show)
Fixed In Version:	4.15.0-112
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2252880 (view as bug list)
Environment:
Last Closed:	2024-03-19 15:29:19 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	RamenDR ramen pull 1157	None	open	[Hub Recovery] ManifestWork regeneration bug and ACM eviction consequences	2023-12-04 20:33:53 UTC
Github	RamenDR ramen pull 1159	None	open	Exclude VolSync secret Policy from hub backup	2023-12-05 16:28:11 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:29:21 UTC

Description Aman Agrawal 2023-12-04 14:26:13 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-30-174049
ACM 2.9.0 GA'ed (from OperatorHub)
ODF 4.14.1-15
ceph version 17.2.6-161.el9cp (7037a43d8f7fa86659a0575b566ec10080df0d71) quincy (stable)
Submariner 0.16.2
VolSync 0.8.0


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
** Active hub being at neutral site **

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster).
3. Ensure that we have all the workloads in distict states like deployed, failedover, relocated etc.
4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime etc.
5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully imported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.
They seem to have retained their last state which was backedup. So everything is fine so far. Label cluster-monitoring on the hub cluster so that VolumeSync.DelayAlert is fired if data sync is affected for any workload.
7. Let IOs continue and check lastGroupSyncTime and VolumeSync.DelayAlert alert. sync for rbd based workloads were progressing just fine along with other cephfs backed workloads except appset-cephfs-busybox9-placement-drpc in NS busybox-workloads-9.
8. Upon further validation, it was found that the dst pods and PVCs were lost from the secondary managed cluster for this workload.

(Older hub remains down forever and is completely unreachable).


Actual results: dst pods and PVCs are lost from the secondary managed cluster for appset-cephfs-busybox9-placement-drpc in NS busybox-workloads-9 

From C2-

amagrawa:~$ oc get pods,vrg,vr,pvc -o wide -n busybox-workloads-9
NAME                                                                                DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/appset-cephfs-busybox9-placement-drpc   secondary      


From passive hub-

drpc-
openshift-gitops       appset-cephfs-busybox9-placement-drpc    23h   amagrawa-1-1d                                       Deployed       Completed                                           True


We also found a bunch of NonCompliant policies post hub recovery however all the policies were Compliant on the older hub.

amagrawa:~$ oc get policy -A | grep NonCompliant
amagrawa-1-1d                    busybox-workloads-12.vs-secret-9b0006378bc4b3dde4a82c04dd7dd3f6                        NonCompliant       23h
amagrawa-1-1d                    openshift-gitops.vs-secret-e00de88ac33afae138ce0bc1dc989ce1                            NonCompliant       23h
amagrawa-2-1d                    busybox-workloads-10.vs-secret-674962f86373128a3005a9a581929a8c                        NonCompliant       23h
amagrawa-2-1d                    busybox-workloads-11.vs-secret-991de86951c29d86a7b4f21afcc222a0                        NonCompliant       23h
amagrawa-2-1d                    busybox-workloads-16.vs-secret-875a04e3829a1b8e35315f7c4a6e0c66                        NonCompliant       23h
amagrawa-2-1d                    openshift-gitops.vs-secret-1a2537d31f7cda5a558e40664f973bd4                            NonCompliant       23h
amagrawa-2-1d                    openshift-gitops.vs-secret-7e65a08e63102201af8f9ada3062686b                            NonCompliant       23h
amagrawa-2-1d                    openshift-gitops.vs-secret-e00de88ac33afae138ce0bc1dc989ce1                            NonCompliant       23h
amagrawa-2-1d                    openshift-gitops.vs-secret-e3b95572f8ad4080e8aec77f9b19d4e4                            NonCompliant       23h
busybox-workloads-10             vs-secret-674962f86373128a3005a9a581929a8c                                             NonCompliant       23h
busybox-workloads-11             vs-secret-991de86951c29d86a7b4f21afcc222a0                                             NonCompliant       23h
busybox-workloads-12             vs-secret-9b0006378bc4b3dde4a82c04dd7dd3f6                                             NonCompliant       23h
busybox-workloads-16             vs-secret-875a04e3829a1b8e35315f7c4a6e0c66                                             NonCompliant       23h
local-cluster                    open-cluster-management-backup.backup-restore-enabled             inform               NonCompliant       28h
open-cluster-management-backup   backup-restore-enabled                                            inform               NonCompliant       28h
openshift-gitops                 vs-secret-1a2537d31f7cda5a558e40664f973bd4                                             NonCompliant       23h
openshift-gitops                 vs-secret-7e65a08e63102201af8f9ada3062686b                                             NonCompliant       23h
openshift-gitops                 vs-secret-e00de88ac33afae138ce0bc1dc989ce1                                             NonCompliant       23h
openshift-gitops                 vs-secret-e3b95572f8ad4080e8aec77f9b19d4e4                                             NonCompliant       23h


Due to missing volumes on the secondary site, data sync isn't progressing for this workload and if in case primary site goes down or there is a need to perform relocate on this workload which we still can, it can lead to complete loss of data (assuming workload pods won't come up on the secondary cluster). 


Logs collected some time after moving to passive hub could be downloaded from https://drive.google.com/file/d/16aUyq1tbkKpumnE6Bmzx1PzBbvuBMwbI/view?usp=drive_link

Pls note it can not unzipped as it's in the Gdrive and not in QE server (which is currently down).

Expected results: volumes shouldn't be lost on the secondary site and data sync should continue just fine for all the DR protected cephfs backed workloads


Additional info:

Comment 13 errata-xmlrpc 2024-03-19 15:29:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Comment 14 Red Hat Bugzilla 2024-07-18 04:25:14 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.