Bug 2132604

Summary:	[RDR] ramen-dr-cluster-operator pod goes into CrashLoopBackOff state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Sidhant Agrawal <sagrawal>
Component:	odf-dr	Assignee:	Vineet <vbadrina>
odf-dr sub component:	multicluster-orchestrator	QA Contact:	Sidhant Agrawal <sagrawal>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aclewett, amagrawa, bmekhiss, jmishra, kseeger, muagarwa, nsoffer, ocs-bugs, odf-bz-bot, rgowdege, srangana, vbadrina
Version:	4.12	Keywords:	TestBlocker
Target Milestone:	---
Target Release:	ODF 4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.12.0-74	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-02-08 14:06:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sidhant Agrawal 2022-10-06 08:36:24 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In RDR setup, ramen-dr-cluster-operator pod on managed clusters goes into CrashLoopBackOff state and restarts continuously. Also observed that the logs are filled with this error:
```
ERROR	controller-runtime.source	source/source.go:139	if kind is a CRD, it should be installed before calling Start	{"kind": "Backup.velero.io", "error": "no matches for kind \"Backup\" in version \"velero.io/v1\""}
```

Output from one of the managed cluster pod logs:
...
2022-10-06T08:26:11.829Z	ERROR	controller-runtime.source	source/source.go:139	if kind is a CRD, it should be installed before calling Start	{"kind": "Restore.velero.io", "error": "no matches for kind \"Restore\" in version \"velero.io/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/source/source.go:139
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:662
k8s.io/apimachinery/pkg/util/wait.poll
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:596
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/source/source.go:132
2022-10-06T08:26:16.879Z	ERROR	controller-runtime.source	source/source.go:139	if kind is a CRD, it should be installed before calling Start	{"kind": "Backup.velero.io", "error": "no matches for kind \"Backup\" in version \"velero.io/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/source/source.go:139
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:662
k8s.io/apimachinery/pkg/util/wait.poll
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:596
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/source/source.go:132
I1006 08:26:18.524896       1 request.go:601] Waited for 1.250961075s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/ramendr.openshift.io/v1alpha1?timeout=32s
2022-10-06T08:26:22.021Z	ERROR	controller-runtime.source	source/source.go:139	if kind is a CRD, it should be installed before calling Start	{"kind": "Restore.velero.io", "error": "no matches for kind \"Restore\" in version \"velero.io/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/source/source.go:139
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:662
k8s.io/apimachinery/pkg/util/wait.poll
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:596
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/source/source.go:132
2022-10-06T08:26:22.426Z	ERROR	controller/controller.go:210	Could not wait for Cache to sync	{"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "error": "failed to wait for volumereplicationgroup caches to sync: timed out waiting for cache to be synced"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:210
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:215
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:241
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/manager/runnable_group.go:219
2022-10-06T08:26:22.426Z	INFO	manager/internal.go:567	Stopping and waiting for non leader election runnables
2022-10-06T08:26:22.426Z	INFO	manager/internal.go:571	Stopping and waiting for leader election runnables
2022-10-06T08:26:22.426Z	INFO	controller/controller.go:247	Shutdown signal received, waiting for all workers to finish	{"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList"}
2022-10-06T08:26:22.426Z	INFO	controller/controller.go:249	All workers finished	{"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList"}
2022-10-06T08:26:22.426Z	INFO	manager/internal.go:577	Stopping and waiting for caches
2022-10-06T08:26:22.426Z	INFO	manager/internal.go:581	Stopping and waiting for webhooks
2022-10-06T08:26:22.427Z	INFO	manager/internal.go:585	Wait completed, proceeding to shutdown the manager
2022-10-06T08:26:22.427Z	ERROR	setup	app/main.go:210	problem running manager	{"error": "failed to wait for volumereplicationgroup caches to sync: timed out waiting for cache to be synced"}
main.main
	/remote-source/app/main.go:210
runtime.main
	/usr/lib/golang/src/runtime/proc.go:250
```
	
Version of all relevant components (if applicable):
OCP: 4.12.0-0.nightly-2022-09-28-204419
ODF: 4.12.0-70
ACM: 2.6.1
Submariner: 0.13.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Configure RDR setup with 1 ACM hub and 2 managed clusters
2. Check status of ramen-dr-cluster-operator pod in managed clusters


Actual results:
ramen-dr-cluster-operator pod goes into CrashLoopBackOff state

Expected results:
ramen-dr-cluster-operator pod should not go into CrashLoopBackOff state

Additional info: