2238922 – [4.12][MDR] ramen-dr-cluster-operator pod crashes during failover

Bug 2238922 - [4.12][MDR] ramen-dr-cluster-operator pod crashes during failover

Summary: [4.12][MDR] ramen-dr-cluster-operator pod crashes during failover

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.12.8
Assignee:	rakesh-gm
QA Contact:	Parikshith
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-14 10:49 UTC by Parikshith
Modified:	2023-09-27 14:49 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.12.8-2
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-27 14:48:44 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ramen pull 140	0	None	Merged	Bug 2238922: Add concurrent access protection to the event recorder	2023-09-14 14:21:11 UTC
Red Hat Product Errata	RHBA-2023:5377	0	None	None	None	2023-09-27 14:49:06 UTC

Description Parikshith 2023-09-14 10:49:18 UTC

Created attachment 1988810 [details]
ramen-dr-op pod log

Description of problem (please be detailed as possible and provide log
snippests):
On MDR 4.12.7 cluster, During failover of apps from c1 to c2 ramen-dr-cluster-operator pod on c2 cluster crashes & restarts (loops).

fatal error: concurrent map read and map write

goroutine 623 [running]:
runtime.throw({0x1bfc894?, 0x4e8225?})
	/usr/lib/golang/src/runtime/panic.go:992 +0x71 fp=0xc000d4cf68 sp=0xc000d4cf38 pc=0x43d8d1
runtime.mapaccess1_faststr(0x1bd001e?, 0x8?, {0xc006494240, 0x24})
	/usr/lib/golang/src/runtime/map_faststr.go:22 +0x3a5 fp=0xc000d4cfd0 sp=0xc000d4cf68 pc=0x419125
github.com/ramendr/ramen/controllers/util.ReportIfNotPresent(0xc0005f15c0, {0x1f1cc60, 0xc001745b00}, {0x1bcee98, 0x7}, {0x1bdaba8, 0xf}, {0xc006396000, 0x10b})
	/remote-source/app/controllers/util/events.go:126 +0x185 fp=0xc000d4d090 sp=0xc000d4cfd0 pc=0x142fe65
github.com/ramendr/ramen/controllers.(*VRGInstance).vrgObjectProtect(0xc000d4da70, 0xc000d4d3d0, {0xc000ee6000, 0x2, 0xc005a92020?})
	/remote-source/app/controllers/vrg_kubeobjects.go:326 +0x351 fp=0xc000d4d368 sp=0xc000d4d090 pc=0x179d7b1
github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsProtect(0xc000a53a70?, 0xc000a533d0)
	/remote-source/app/controllers/vrg_kubeobjects.go:90 +0x91 fp=0xc000d4d3c0 sp=0xc000d4d368 pc=0x179aa71
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary(0xc000a53a70)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:822 +0x58 fp=0xc000d4d3f0 sp=0xc000d4d3c0 pc=0x17993f8
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary(0xc000d4da70)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:795 +0x392 fp=0xc000d4d570 sp=0xc000d4d3f0 pc=0x17990f2
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions(0xc000d15a70)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:466 +0x247 fp=0xc000d4d6c8 sp=0xc000d4d570 pc=0x1795b27
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG(0xc000d15a70)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:439 +0x67b fp=0xc000d4d890 sp=0xc000d4d6c8 pc=0x17958bb
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile(0xc00075d730, {0x1f2ebf0?, 0xc001982cf0}, {{{0xc00114b306, 0x9}, {0xc0084c8300, 0x1a}}})
	/remote-source/app/controllers/volumereplicationgroup_controller.go:336 +0xab2 fp=0xc000d4dcd8 sp=0xc000d4d890 pc=0x1794e72
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1f2eb48?, {0x1f2ebf0?, 0xc001982cf0?}, {{{0xc00114b306?, 0x1aed1e0?}, {0xc0084c8300?, 0x4095f4?}}})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:121 +0xc8 fp=0xc000d4dd78 sp=0xc000d4dcd8 pc=0x13018a8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0002fcbe0, {0x1f2eb48, 0xc000798e80}, {0x19e8b40?, 0xc001904a20?})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:320 +0x33c fp=0xc000d4dee0 sp=0xc000d4dd78 pc=0x130399c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0002fcbe0, {0x1f2eb48, 0xc000798e80})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:273 +0x1d9 fp=0xc000d4df80 sp=0xc000d4dee0 pc=0x13031d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:234 +0x85 fp=0xc000d4dfe0 sp=0xc000d4df80 pc=0x1302c25
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc000d4dfe8 sp=0xc000d4dfe0 pc=0x46d8a1
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:230 +0x325

goroutine 1 [select]:
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start(0xc0008bd380, {0x1f2eb48, 0xc000844100})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/manager/internal.go:500 +0x5e7
main.main()
	/remote-source/app/main.go:198 +0x1c5


ramen-dr-op pod full log is attached.

Version of all relevant components (if applicable):
ODF/MCO 4.12.7
ocp: 4.12.32
ACM: 2.7.7

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Just that ramend-dr pod in clbo but failover does get completed even though ramen-dr pod crashes

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Not sure, Noticed while prepping up 4.12.7 MDR cluster with apps in failedover & relocated states before upgrading to 4.12.8 builds.
 
Steps to Reproduce:
1. Deploy and configure MDR cluster. c1, c2 and hub
2. Create multiple subscription apps on c1 and c2
3. Fence c1 cluster
4. Failover all c1 apps to c2. (failedover 5 apps from c1 to c2)

Actual results:
After initiating failover, ramen-dr-cluster-operator pod on c2 goes to clbo

Expected results:
After initiating failover, ramen-dr-cluster-operator pod on c2 should not clbo 


Additional info:
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=2208558
Faileover of apps completed successfully from c1 to c2 even though the ramen-dr pod crashed. 
I could create new apps and assign policies while ramen-dr pod is in clbo

Comment 14 errata-xmlrpc 2023-09-27 14:48:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.8 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:5377

Note You need to log in before you can comment on or make changes to this bug.