Created attachment 1965714 [details] ramen-dr-op log Created attachment 1965714 [details] ramen-dr-op log Description of problem (please be detailed as possible and provide log snippests): On MDR cluster, During failover of apps from c1 to c2 ramen-dr-cluster-operator pod crashes & restarts (loops). 2023-05-19T11:55:51.526Z INFO controllers.VolumeReplicationGroup.vrginstance controllers/volumereplicationgroup_controller.go:604 VRG's ClusterDataReady condition found. PV restore must have already been applied {"VolumeReplicationGroup": "job-sub-2/job-sub-2-placement-1-drpc", "rid": "d70b3bce-7b18-41b2-86d9-07f7f24a9005", "State": "primary"} fatal error: concurrent map read and map write goroutine 669 [running]: github.com/ramendr/ramen/controllers/util.ReportIfNotPresent(0xc0004c2c40, {0x2bdacb0, 0xc00125cb40}, {0x2679dcc, 0x7}, {0x2684926, 0xc}, {0xc00cd5a000, 0x1ad}) /remote-source/app/controllers/util/events.go:126 +0x185 github.com/ramendr/ramen/controllers.(*VRGInstance).UploadPVandPVCtoS3Stores(0xc002370500, 0xc00280c780, {{0x40d70f?, 0x0?}, 0xc00249c750?}) /remote-source/app/controllers/vrg_volrep.go:621 +0x23b github.com/ramendr/ramen/controllers.(*VRGInstance).uploadPVandPVCtoS3Stores(0xc002370500, 0xc00280c780, {{0x2bf7878?, 0xc00297bf80?}, 0x0?}) /remote-source/app/controllers/vrg_volrep.go:541 +0x2b6 github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileVolRepsAsPrimary(0xc002370500, 0xc0023f5950) /remote-source/app/controllers/vrg_volrep.go:77 +0x385 github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary(0xc002370500) /remote-source/app/controllers/volumereplicationgroup_controller.go:903 +0x70 github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary(0xc002370500) /remote-source/app/controllers/volumereplicationgroup_controller.go:876 +0x392 github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions(0xc002370500) /remote-source/app/controllers/volumereplicationgroup_controller.go:542 +0x24d github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG(0xc002370500) /remote-source/app/controllers/volumereplicationgroup_controller.go:515 +0x67a github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile(0xc0004bd490, {0x2bf2fb0?, 0xc002927500}, {{{0xc0019a1a46, 0x7}, {0xc00158f1a0, 0x18}}}) /remote-source/app/controllers/volumereplicationgroup_controller.go:404 +0xb3a sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x2bf2fb0?, {0x2bf2fb0?, 0xc002927500?}, {{{0xc0019a1a46?, 0x21d5ea0?}, {0xc00158f1a0?, 0x0?}}}) /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:122 +0xc8 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00066ce60, {0x2bf2f08, 0xc00062c0c0}, {0x23b8760?, 0xc0014113a0?}) /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:323 +0x38f sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00066ce60, {0x2bf2f08, 0xc00062c0c0}) /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:274 +0x1d9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:235 +0x85 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controlle ramen-dr-op pod full log is attached. Version of all relevant components (if applicable): ODF/MCO 4.13.0-199 ocp: 4.13.0-0.nightly-2023-05-10-112355 ACM: 2.7.3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? failover does get completed even though ramen-dr pod crashes Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? 3/3 Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: yes. Did not see this issue in 4.12 builds. AFAIk, also in prior 4.13 builds. Steps to Reproduce: 1. Deploy and configure MDR cluster. c1, c2 and hub 2. Create multiple subscription and appset based apps on c1 and c2 3. Fence c1 cluster 4. Failover all c1 apps to c2. (failedover around 10 apps from c1 to c2) Actual results: After initiating failover, ramen-dr-cluster-operator pod on c2 goes to clbo Expected results: After initiating failover, ramen-dr-cluster-operator pod on c2 should not clbo and restart Additional info: 1. Failover of apps to c2 does get completed even though its ramen-dr pod crashes and restarts 2. Also, ramen-dr-op pod on c1 or c2 does not crash during relocate
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742