Created attachment 1988810 [details] ramen-dr-op pod log Description of problem (please be detailed as possible and provide log snippests): On MDR 4.12.7 cluster, During failover of apps from c1 to c2 ramen-dr-cluster-operator pod on c2 cluster crashes & restarts (loops). fatal error: concurrent map read and map write goroutine 623 [running]: runtime.throw({0x1bfc894?, 0x4e8225?}) /usr/lib/golang/src/runtime/panic.go:992 +0x71 fp=0xc000d4cf68 sp=0xc000d4cf38 pc=0x43d8d1 runtime.mapaccess1_faststr(0x1bd001e?, 0x8?, {0xc006494240, 0x24}) /usr/lib/golang/src/runtime/map_faststr.go:22 +0x3a5 fp=0xc000d4cfd0 sp=0xc000d4cf68 pc=0x419125 github.com/ramendr/ramen/controllers/util.ReportIfNotPresent(0xc0005f15c0, {0x1f1cc60, 0xc001745b00}, {0x1bcee98, 0x7}, {0x1bdaba8, 0xf}, {0xc006396000, 0x10b}) /remote-source/app/controllers/util/events.go:126 +0x185 fp=0xc000d4d090 sp=0xc000d4cfd0 pc=0x142fe65 github.com/ramendr/ramen/controllers.(*VRGInstance).vrgObjectProtect(0xc000d4da70, 0xc000d4d3d0, {0xc000ee6000, 0x2, 0xc005a92020?}) /remote-source/app/controllers/vrg_kubeobjects.go:326 +0x351 fp=0xc000d4d368 sp=0xc000d4d090 pc=0x179d7b1 github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsProtect(0xc000a53a70?, 0xc000a533d0) /remote-source/app/controllers/vrg_kubeobjects.go:90 +0x91 fp=0xc000d4d3c0 sp=0xc000d4d368 pc=0x179aa71 github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary(0xc000a53a70) /remote-source/app/controllers/volumereplicationgroup_controller.go:822 +0x58 fp=0xc000d4d3f0 sp=0xc000d4d3c0 pc=0x17993f8 github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary(0xc000d4da70) /remote-source/app/controllers/volumereplicationgroup_controller.go:795 +0x392 fp=0xc000d4d570 sp=0xc000d4d3f0 pc=0x17990f2 github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions(0xc000d15a70) /remote-source/app/controllers/volumereplicationgroup_controller.go:466 +0x247 fp=0xc000d4d6c8 sp=0xc000d4d570 pc=0x1795b27 github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG(0xc000d15a70) /remote-source/app/controllers/volumereplicationgroup_controller.go:439 +0x67b fp=0xc000d4d890 sp=0xc000d4d6c8 pc=0x17958bb github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile(0xc00075d730, {0x1f2ebf0?, 0xc001982cf0}, {{{0xc00114b306, 0x9}, {0xc0084c8300, 0x1a}}}) /remote-source/app/controllers/volumereplicationgroup_controller.go:336 +0xab2 fp=0xc000d4dcd8 sp=0xc000d4d890 pc=0x1794e72 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1f2eb48?, {0x1f2ebf0?, 0xc001982cf0?}, {{{0xc00114b306?, 0x1aed1e0?}, {0xc0084c8300?, 0x4095f4?}}}) /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:121 +0xc8 fp=0xc000d4dd78 sp=0xc000d4dcd8 pc=0x13018a8 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0002fcbe0, {0x1f2eb48, 0xc000798e80}, {0x19e8b40?, 0xc001904a20?}) /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:320 +0x33c fp=0xc000d4dee0 sp=0xc000d4dd78 pc=0x130399c sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0002fcbe0, {0x1f2eb48, 0xc000798e80}) /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:273 +0x1d9 fp=0xc000d4df80 sp=0xc000d4dee0 pc=0x13031d9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:234 +0x85 fp=0xc000d4dfe0 sp=0xc000d4df80 pc=0x1302c25 runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc000d4dfe8 sp=0xc000d4dfe0 pc=0x46d8a1 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:230 +0x325 goroutine 1 [select]: sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start(0xc0008bd380, {0x1f2eb48, 0xc000844100}) /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/manager/internal.go:500 +0x5e7 main.main() /remote-source/app/main.go:198 +0x1c5 ramen-dr-op pod full log is attached. Version of all relevant components (if applicable): ODF/MCO 4.12.7 ocp: 4.12.32 ACM: 2.7.7 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Just that ramend-dr pod in clbo but failover does get completed even though ramen-dr pod crashes Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Not sure, Noticed while prepping up 4.12.7 MDR cluster with apps in failedover & relocated states before upgrading to 4.12.8 builds. Steps to Reproduce: 1. Deploy and configure MDR cluster. c1, c2 and hub 2. Create multiple subscription apps on c1 and c2 3. Fence c1 cluster 4. Failover all c1 apps to c2. (failedover 5 apps from c1 to c2) Actual results: After initiating failover, ramen-dr-cluster-operator pod on c2 goes to clbo Expected results: After initiating failover, ramen-dr-cluster-operator pod on c2 should not clbo Additional info: Similar to https://bugzilla.redhat.com/show_bug.cgi?id=2208558 Faileover of apps completed successfully from c1 to c2 even though the ramen-dr pod crashed. I could create new apps and assign policies while ramen-dr pod is in clbo
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.12.8 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:5377