2121329 – ocs-operator pods getting OOMKilled failing the ocs-consumer installations on the respective cluster

Bug 2121329 - ocs-operator pods getting OOMKilled failing the ocs-consumer installations on the respective cluster

Summary: ocs-operator pods getting OOMKilled failing the ocs-consumer installations on...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-managed-service
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Ohad
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2161650 (view as bug list)
Depends On:	2126626 2161650
Blocks:	2131662
TreeView+	depends on / blocked

Reported:	2022-08-25 07:16 UTC by Yashvardhan Kukreja
Modified:	2024-04-05 04:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2126626 2131662 (view as bug list)
Environment:
Last Closed:	2023-12-06 09:47:55 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2131662	0	unspecified	CLOSED	ocs-operator pods getting OOMKilled failing the ocs-consumer installations on the respective cluster	2023-08-09 17:00:43 UTC

Description Yashvardhan Kukreja 2022-08-25 07:16:19 UTC

Issue 2

Description of the problem:

Currently, a bunch of ocs-consumers are failing. The trigger is the ocs-operator pod getting OOMKilled because of its demand of higher memory usage than the memory limits allocated to them.
The root cause behind this is that ocs-operator is relying on the default "AllNamespaces" cache in controller-runtime, which works by syncing all the Kubernetes resources in it when the operator starts running for the first time.

The problem here is that the initial informer cache sync is so huge that it causes a sudden massive spike in the memory usage of the operator. And even worse, this spike is directly proportional to the amount of resources present in the underlying Kubernetes/Openshift cluster.

The kind of memory limits configured for ocs-operator mostly compensates for the memory spike by a close margin, yet there are a few situations where the memory spike is not compensated by the set memory limits which causes OOMKilled failures for the ocs-operator pods.

Reproducing the issue:

Currently, there isn't any sure short way of reproducing this issue except somehow making the ocs-operator pod consume higher memory. You can start off with having a lot of resources on the cluster, and then installing ocs-operator on it.

Component controlling this:

ocs-operator 4.10.4

Proposed Mitigation:

This is not a solution but just a mitigation for the existing problems (solution in the next section):

Bump the memory limits of ocs-operator from 200Mi to 800Mi. We came to 800Mi as per our experience of manually fixing the clusters facing the same issue.

Specifically, this line (https://github.com/red-hat-storage/ocs-osd-deployer/blob/02ebe3916210326d00fae53bf55cbfef53ac1edb/utils/resources.go#L90) has to be modified to resource.MustParse("200Mi"),

Proposed solution:

The ideal solution would be to have a dedicated informer caching mechanism set, rather than the default "AllNamespaces" one, which would only cache-sync the resources / custom-resources which ocs-operator cares about.

ocs-operator's code would have to use the controller-runtime's cache package to build a customer cache builder for itself.

For example: https://github.com/openshift/addon-operator/blob/main/cmd/addon-operator-manager/main.go#L179-L188

Expected Results:
ocs-operator pods running without facing any OOMKill failures and without needing any restarts.

Actual Results:
❯ oc get pods -n openshift-storage ocs-operator-5bf7c58cc9-pbjtj
NAME READY STATUS RESTARTS AGE
ocs-operator-5bf7c58cc9-pbjtj 0/1 CrashLoopBackOff 9520 (74s ago) 37d
ocs-operator pod stuck in a CrashLoopBackOff because of getting OOMKilled and restarting itself in the hopes of working properly.

Comment 1 Leela Venkaiah Gangavarapu 2022-09-12 10:49:47 UTC

- Interim fix will be delivered via https://bugzilla.redhat.com/show_bug.cgi?id=2125254

Comment 7 Dhruv Bindra 2023-01-20 10:24:59 UTC

*** Bug 2161650 has been marked as a duplicate of this bug. ***

Comment 10 Filip Balák 2023-02-27 11:38:45 UTC

Jilju, could you please take a look? I see that you were QA assignee on https://bugzilla.redhat.com/show_bug.cgi?id=2125254.

Comment 13 Jilju Joy 2023-03-24 06:43:44 UTC

(In reply to Filip Balák from comment #10)
> Jilju, could you please take a look? I see that you were QA assignee on
> https://bugzilla.redhat.com/show_bug.cgi?id=2125254.

Hi Filip,
According to the comment #12, the bug is fixed in ODF 4.12 . We do not have ODF 4.12 in ODF MS yet. So this bug is not ready for verification.

Comment 18 Rewant 2023-12-06 09:47:55 UTC

Closing this Bug as per: https://bugzilla.redhat.com/show_bug.cgi?id=2121329#c16

Comment 19 Red Hat Bugzilla 2024-04-05 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.