2208558 – [MDR] ramen-dr-cluster-operator pod crashes during failover

Bug 2208558 - [MDR] ramen-dr-cluster-operator pod crashes during failover

Summary: [MDR] ramen-dr-cluster-operator pod crashes during failover

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Benamar Mekhissi
QA Contact:	Parikshith
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-19 14:10 UTC by Parikshith
Modified:	2023-08-09 17:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.13.0-203
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-21 15:25:37 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	RamenDR ramen pull 890	0	None	Merged	Add concurrent access protection to the event recorder	2023-05-22 09:09:34 UTC
Red Hat Product Errata	RHBA-2023:3742	0	None	None	None	2023-06-21 15:25:48 UTC

Description Parikshith 2023-05-19 14:10:57 UTC

Created attachment 1965714 [details]
ramen-dr-op log

Created attachment 1965714 [details]
ramen-dr-op log

Description of problem (please be detailed as possible and provide log
snippests):
On MDR cluster, During failover of apps from c1 to c2 ramen-dr-cluster-operator pod crashes & restarts (loops).

2023-05-19T11:55:51.526Z	INFO	controllers.VolumeReplicationGroup.vrginstance	controllers/volumereplicationgroup_controller.go:604	VRG's ClusterDataReady condition found. PV restore must have already been applied	{"VolumeReplicationGroup": "job-sub-2/job-sub-2-placement-1-drpc", "rid": "d70b3bce-7b18-41b2-86d9-07f7f24a9005", "State": "primary"}
fatal error: concurrent map read and map write

goroutine 669 [running]:
github.com/ramendr/ramen/controllers/util.ReportIfNotPresent(0xc0004c2c40, {0x2bdacb0, 0xc00125cb40}, {0x2679dcc, 0x7}, {0x2684926, 0xc}, {0xc00cd5a000, 0x1ad})
	/remote-source/app/controllers/util/events.go:126 +0x185
github.com/ramendr/ramen/controllers.(*VRGInstance).UploadPVandPVCtoS3Stores(0xc002370500, 0xc00280c780, {{0x40d70f?, 0x0?}, 0xc00249c750?})
	/remote-source/app/controllers/vrg_volrep.go:621 +0x23b
github.com/ramendr/ramen/controllers.(*VRGInstance).uploadPVandPVCtoS3Stores(0xc002370500, 0xc00280c780, {{0x2bf7878?, 0xc00297bf80?}, 0x0?})
	/remote-source/app/controllers/vrg_volrep.go:541 +0x2b6
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileVolRepsAsPrimary(0xc002370500, 0xc0023f5950)
	/remote-source/app/controllers/vrg_volrep.go:77 +0x385
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary(0xc002370500)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:903 +0x70
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary(0xc002370500)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:876 +0x392
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions(0xc002370500)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:542 +0x24d
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG(0xc002370500)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:515 +0x67a
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile(0xc0004bd490, {0x2bf2fb0?, 0xc002927500}, {{{0xc0019a1a46, 0x7}, {0xc00158f1a0, 0x18}}})
	/remote-source/app/controllers/volumereplicationgroup_controller.go:404 +0xb3a
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x2bf2fb0?, {0x2bf2fb0?, 0xc002927500?}, {{{0xc0019a1a46?, 0x21d5ea0?}, {0xc00158f1a0?, 0x0?}}})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:122 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00066ce60, {0x2bf2f08, 0xc00062c0c0}, {0x23b8760?, 0xc0014113a0?})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:323 +0x38f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00066ce60, {0x2bf2f08, 0xc00062c0c0})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:274 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:235 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controlle

ramen-dr-op pod full log is attached.

Version of all relevant components (if applicable):
ODF/MCO 4.13.0-199
ocp: 4.13.0-0.nightly-2023-05-10-112355
ACM: 2.7.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
failover does get completed even though ramen-dr pod crashes

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
3/3

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
yes. Did not see this issue in 4.12 builds. AFAIk, also in prior 4.13 builds. 

Steps to Reproduce:
1. Deploy and configure MDR cluster. c1, c2 and hub
2. Create multiple subscription and appset based apps on c1 and c2
3. Fence c1 cluster
4. Failover all c1 apps to c2. (failedover around 10 apps from c1 to c2)

Actual results:
After initiating failover, ramen-dr-cluster-operator pod on c2 goes to clbo

Expected results:
After initiating failover, ramen-dr-cluster-operator pod on c2 should not clbo and restart


Additional info:
1. Failover of apps to c2 does get completed even though its ramen-dr pod crashes and restarts
2. Also, ramen-dr-op pod on c1 or c2 does not crash during relocate

Comment 14 errata-xmlrpc 2023-06-21 15:25:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.