2276344 – [RDR][MDR] [Discovered Apps] ramen-dr-cluster-operator pod in CrashLoopBackOff state

Bug 2276344 - [RDR][MDR] [Discovered Apps] ramen-dr-cluster-operator pod in CrashLoopBackOff state

Summary: [RDR][MDR] [Discovered Apps] ramen-dr-cluster-operator pod in CrashLoopBackOf...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Raghavendra Talur
QA Contact:	Sidhant Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-04-22 06:55 UTC by Sidhant Agrawal
Modified:	2024-07-17 13:20 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.16.0-94
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-07-17 13:20:08 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage recipe pull 9	0	None	Merged	config: add downstream metadata	2024-05-06 08:23:32 UTC
Red Hat Product Errata	RHSA-2024:4591	0	None	None	None	2024-07-17 13:20:10 UTC

Description Sidhant Agrawal 2024-04-22 06:55:39 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In RDR setup, after creating DRPolicy on hub cluster, ramen-dr-cluster-operator pod on managed clusters goes into CrashLoopBackOff state due to missing Recipe and Velero CRDs

$ oc get pod -n openshift-dr-system
NAME                                         READY   STATUS             RESTARTS      AGE
ramen-dr-cluster-operator-59d8dd9fd4-qnv84   1/2     CrashLoopBackOff   9 (12s ago)   44m


Error messages from pod logs:
```
2024-04-18T16:25:10.824Z	ERROR	controller-runtime.source.EventHandler	source/kind.go:68	failed to get informer from cache	{"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: velero.io/v1: the server could not find the requested resource"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:56
2024-04-18T16:25:10.826Z	ERROR	controller-runtime.source.EventHandler	source/kind.go:63	if kind is a CRD, it should be installed before calling Start	{"kind": "Recipe.ramendr.openshift.io", "error": "no matches for kind \"Recipe\" in version \"ramendr.openshift.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:56
2024-04-18T16:25:10.827Z	ERROR	controller-runtime.source.EventHandler	source/kind.go:68	failed to get informer from cache	{"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: velero.io/v1: the server could not find the requested resource"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:56
2024-04-18T16:25:10.936Z	INFO	pvcmap.VolumeReplicationGroup	controllers/volumereplicationgroup_controller.go:178	Create event for PersistentVolumeClaim
2024-04-18T16:25:13.021Z	INFO	configmap.VolumeReplicationGroup	controllers/volumereplicationgroup_controller.go:144	Update in ramen-dr-cluster-operator-config configuration map
2024-04-18T16:25:13.037Z	INFO	controller/controller.go:220	Starting workers	{"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList", "worker count": 1}
2024-04-18T16:25:13.038Z	INFO	pvcmap.VolumeReplicationGroup	controllers/volumereplicationgroup_controller.go:178	Create event for PersistentVolumeClaim
2024-04-18T16:25:13.038Z	INFO	pvcmap.VolumeReplicationGroup	controllers/volumereplicationgroup_controller.go:178	Create event for PersistentVolumeClaim
2024-04-18T16:25:13.038Z	INFO	pvcmap.VolumeReplicationGroup	controllers/volumereplicationgroup_controller.go:178	Create event for PersistentVolumeClaim
2024-04-18T16:25:13.038Z	INFO	pvcmap.VolumeReplicationGroup	controllers/volumereplicationgroup_controller.go:178	Create event for PersistentVolumeClaim
2024-04-18T16:25:13.039Z	INFO	pvcmap.VolumeReplicationGroup	controllers/volumereplicationgroup_controller.go:178	Create event for PersistentVolumeClaim
2024-04-18T16:25:13.039Z	INFO	pvcmap.VolumeReplicationGroup	controllers/volumereplicationgroup_controller.go:178	Create event for PersistentVolumeClaim
2024-04-18T16:25:20.821Z	ERROR	controller-runtime.source.EventHandler	source/kind.go:63	if kind is a CRD, it should be installed before calling Start	{"kind": "Recipe.ramendr.openshift.io", "error": "no matches for kind \"Recipe\" in version \"ramendr.openshift.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:56
```


Version of all relevant components (if applicable):
OCP: 4.16.0-0.nightly-2024-04-16-195622
ODF: 4.16.0-79.stable

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configure RDR setup with 1 ACM hub and 2 managed clusters
2. Install MCO on hub cluster and then create DRPolicy
3. Observe the ramen-dr-cluster-operator pod status on managed clusters


Actual results:
ramen-dr-cluster-operator pod goes into CrashLoopBackOff state

Expected results:
ramen-dr-cluster-operator pod should not go into CrashLoopBackOff state


Additional info:

Comment 4 avdhoot 2024-04-25 09:11:31 UTC

Facing similar issue with MDR as well.

OCP- 4.16
ODF- 4.16.0-85

➜  clust1 oc get pod -n openshift-dr-system
NAME                                         READY   STATUS             RESTARTS        AGE
ramen-dr-cluster-operator-759cc88f66-4l6n2   1/2     CrashLoopBackOff   117 (65s ago)   14h

Error messages from pod logs:

```
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:56
2024-04-25T08:08:48.407Z	ERROR	controller-runtime.source.EventHandler	source/kind.go:68	failed to get informer from cache	{"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: velero.io/v1: the server could not find the requested resource"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.0/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/source/kind.go:56
2024-04-25T08:08:52.581Z	ERROR	controller/controller.go:203	Could not wait for Cache to sync	{"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "error": "failed to wait for volumereplicationgroup caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.Recipe"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:203
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/manager/runnable_group.go:223
2024-04-25T08:08:52.581Z	INFO	manager/internal.go:516	Stopping and waiting for non leader election runnables
2024-04-25T08:08:52.581Z	INFO	manager/internal.go:520	Stopping and waiting for leader election runnables
2024-04-25T08:08:52.581Z	INFO	controller/controller.go:240	Shutdown signal received, waiting for all workers to finish	{"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList"}
2024-04-25T08:08:52.581Z	INFO	controller/controller.go:242	All workers finished	{"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList"}
2024-04-25T08:08:52.581Z	INFO	manager/internal.go:526	Stopping and waiting for caches
2024-04-25T08:08:52.581Z	INFO	manager/internal.go:530	Stopping and waiting for webhooks
2024-04-25T08:08:52.581Z	INFO	manager/internal.go:533	Stopping and waiting for HTTP servers
2024-04-25T08:08:52.581Z	INFO	controller-runtime.metrics	server/server.go:231	Shutting down metrics server with timeout of 1 minute
2024-04-25T08:08:52.581Z	INFO	manager/server.go:43	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-04-25T08:08:52.582Z	INFO	manager/internal.go:537	Wait completed, proceeding to shutdown the manager
2024-04-25T08:08:52.582Z	ERROR	setup	app/main.go:247	problem running manager	{"error": "failed to wait for volumereplicationgroup caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.Recipe"}
main.main
	/remote-source/app/main.go:247
runtime.main
	/usr/lib/golang/src/runtime/proc.go:267
```

Comment 12 errata-xmlrpc 2024-07-17 13:20:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.