Bug 2055359

Summary: [DR] ramen-dr-cluster-operator pod go into CrashLoopBackOff state during failover or relocate operation
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sidhant Agrawal <sagrawal>
Component: odf-drAssignee: Shyamsundar <srangana>
odf-dr sub component: ramen QA Contact: Sidhant Agrawal <sagrawal>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: urgent    
Priority: unspecified CC: kramdoss, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, prsurve, rperiyas, srangana
Version: 4.10Keywords: AutomationBackLog
Target Milestone: ODF Blockers Only (Development Freeze)   
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-167 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-21 09:12:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sidhant Agrawal 2022-02-16 18:38:09 UTC
Description of problem (please be detailed as possible and provide log
snippests):
After initiating failover or relocate operation, ramen-dr-cluster-operator pod goes into CrashLoopBackOff state and restarts ~6 times before coming to Running state.
Also observed the panic "panic: odd number of arguments passed as key-value pairs for logging" in the pod logs.

Output from one of the managed cluster: 
```
2022-02-16T18:22:18.943Z	DPANIC	controllers.VolumeReplicationGroup.vrginstance	controllers/volumereplicationgroup_controller.go:1671	odd number of arguments passed as key-value pairs for logging	{"VolumeReplicationGroup": "busybox-sagrawal-c1/busybox-drpc", "State": "secondary", "pvc": "busybox-sagrawal-c1/busybox-pvc-2", "ignored key": "secondary"}
github.com/ramendr/ramen/controllers.(*VRGInstance).updateVR
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1671
github.com/ramendr/ramen/controllers.(*VRGInstance).createOrUpdateVR
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1640
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRAsSecondary
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1586
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileVRAsSecondary
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1162
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileVRsAsSecondary
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1128
github.com/ramendr/ramen/controllers.(*VRGInstance).handleVRGMode
	/remote-source/app/controllers/volumereplicationgroup_controller.go:983
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsSecondary
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1082
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions
	/remote-source/app/controllers/volumereplicationgroup_controller.go:412
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
	/remote-source/app/controllers/volumereplicationgroup_controller.go:398
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
	/remote-source/app/controllers/volumereplicationgroup_controller.go:312
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0-beta.4/pkg/internal/controller/controller.go:298
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0-beta.4/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0-beta.4/pkg/internal/controller/controller.go:214
2022-02-16T18:22:19.032Z	INFO	controllers.VolumeReplicationGroup.vrginstance	zapcore/entry.go:234	Exiting processing VolumeReplicationGroup	{"VolumeReplicationGroup": "busybox-sagrawal-c1/busybox-drpc", "State": "secondary"}
2022-02-16T18:22:19.032Z	INFO	controllers.VolumeReplicationGroup	zapcore/entry.go:234	Exiting reconcile loop	{"VolumeReplicationGroup": "busybox-sagrawal-c1/busybox-drpc"}
panic: odd number of arguments passed as key-value pairs for logging

goroutine 565 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000af8480, 0xc00099c540, 0x1, 0x1)
	/remote-source/deps/gomod/pkg/mod/go.uber.org/zap.1/zapcore/entry.go:234 +0x532
go.uber.org/zap.(*Logger).DPanic(0xc000c9aae0, 0x1dc24e8, 0x3d, 0xc00099c540, 0x1, 0x1)
	/remote-source/deps/gomod/pkg/mod/go.uber.org/zap.1/logger.go:220 +0x85
github.com/go-logr/zapr.handleFields(0xc000c9aae0, 0xc000d17830, 0x3, 0x3, 0x0, 0x0, 0x0, 0x10, 0x1a5d760, 0x1)
	/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:100 +0x5e5
github.com/go-logr/zapr.(*zapLogger).Info(0xc00014aab0, 0x1d940f6, 0x29, 0xc000d17830, 0x3, 0x3)
	/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:127 +0xb0
github.com/ramendr/ramen/controllers.(*VRGInstance).updateVR(0xc000af8cc0, 0xc000db3180, 0x1d5da48, 0x9, 0x207dfc8, 0xc00014aab0, 0xd, 0x20989e0, 0xc000db3180)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1671 +0x6e3
github.com/ramendr/ramen/controllers.(*VRGInstance).createOrUpdateVR(0xc000af8cc0, 0xc0005d53c8, 0x13, 0xc0002f1550, 0xd, 0x1d5da48, 0x9, 0x207dfc8, 0xc00014aab0, 0xc0009f9518, ...)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1640 +0x957
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRAsSecondary(...)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1586
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileVRAsSecondary(0xc000af8cc0, 0xc000517d98, 0x207dfc8, 0xc00014aab0, 0x2070000)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1162 +0xd2
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileVRsAsSecondary(0xc000af8cc0, 0x0)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1128 +0x26b
github.com/ramendr/ramen/controllers.(*VRGInstance).handleVRGMode(0xc000af8cc0, 0x1d5da48, 0x9, 0x0)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:983 +0x115
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsSecondary(0xc000af8cc0, 0xc0005b4d00, 0x0, 0x0, 0x0)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:1082 +0x3fe
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions(0xc000af8cc0, 0x0, 0x0, 0x431ae7, 0x585585585585)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:412 +0x1c5
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG(0xc000af8cc0, 0xc000af8cf0, 0xc000c82d80, 0xc00061abe8, 0x13)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:398 +0x74e
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile(0xc000bb8300, 0x2073c78, 0xc000c82d80, 0xc00061abe8, 0x13, 0xc000b872d0, 0xc, 0xc000c82d00, 0x0, 0x0, ...)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:312 +0x710
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000b43ea0, 0x2073bd0, 0xc000d30000, 0x1b72b60, 0xc000d2a440)
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0-beta.4/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000b43ea0, 0x2073bd0, 0xc000d30000, 0xc000594d00)
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0-beta.4/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc0002f04b0, 0xc000b43ea0, 0x2073bd0, 0xc000d30000)
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0-beta.4/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0-beta.4/pkg/internal/controller/controller.go:210 +0x425
```
	
Version of all relevant components (if applicable):
OCP: 4.10.0-0.nightly-2022-02-11-123954
ODF: 4.10.0-156

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configure RDR setup with 1 ACM hub and 2 managed clusters
2. Deploy workload
3. Initiate failover or relocate action and observe the ramen-dr-cluster-operator pod where the workload was running


Actual results:
ramen-dr-cluster-operator pod goes into CrashLoopBackOff state due to panic during failover or relocate operations

Expected results:
ramen-dr-cluster-operator pod should not go into CrashLoopBackOff state during failover or relocate operations

Additional info:

Comment 8 Mudit Agarwal 2022-02-23 02:56:07 UTC
Please backport to 4.10