Bug 2125254

Summary: ocs-operator pods getting OOMKilled failing the ocs-consumer installations on the respective cluster due to low memory limits
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yashvardhan Kukreja <ykukreja>
Component: odf-managed-serviceAssignee: Dhruv Bindra <dbindra>
Status: CLOSED CURRENTRELEASE QA Contact: Jilju Joy <jijoy>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.10CC: aeyal, dbindra, godas, jijoy, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 2.0.6-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-20 13:33:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yashvardhan Kukreja 2022-09-08 12:51:05 UTC
Description of the problem:

Currently, a bunch of ocs-consumers are failing. The trigger is the ocs-operator pod getting OOMKilled because of its demand of higher memory usage than the memory limits allocated to them.
The root cause behind this is that ocs-operator is relying on the default "AllNamespaces" cache in controller-runtime, which works by syncing all the Kubernetes resources in it when the operator starts running for the first time.

The problem here is that the initial informer cache sync is so huge that it causes a sudden massive spike in the memory usage of the operator. And even worse, this spike is directly proportional to the amount of resources present in the underlying Kubernetes/Openshift cluster.

The kind of memory limits configured for ocs-operator mostly compensates for the memory spike by a close margin, yet there are a few situations where the memory spike is not compensated by the set memory limits which causes OOMKilled failures for the ocs-operator pods.

Reproducing the issue:

Currently, there isn't any sure short way of reproducing this issue except somehow making the ocs-operator pod consume higher memory. You can start off with having a lot of resources on the cluster, and then installing ocs-operator on it.

Component controlling this:

ocs-operator 4.10.4

Currently proposed solution:

At the moment, the immediate solution would be to bump the memory limits of ocs-operator from 200Mi to 800Mi. We came to 800Mi as per our experience of manually fixing the clusters facing the same issue.

Specifically, this line (https://github.com/red-hat-storage/ocs-osd-deployer/blob/02ebe3916210326d00fae53bf55cbfef53ac1edb/utils/resources.go#L90) has to be modified to resource.MustParse("200Mi"),

Expected Results:
ocs-operator pods running without facing any OOMKill failures and without needing any restarts.

 Actual Results:
❯ oc get pods -n openshift-storage ocs-operator-5bf7c58cc9-pbjtj
NAME                            READY   STATUS             RESTARTS         AGE
ocs-operator-5bf7c58cc9-pbjtj   0/1     CrashLoopBackOff   9520 (74s ago)   37d
ocs-operator pod stuck in a CrashLoopBackOff because of getting OOMKilled and restarting itself in the hopes of working properly.

Comment 1 Yashvardhan Kukreja 2022-09-08 12:51:38 UTC
The long term solution to be seeked after the above bug is fixed is here - https://bugzilla.redhat.com/show_bug.cgi?id=2121329

Comment 2 Jilju Joy 2022-09-15 06:22:56 UTC
Verified that the memory of ocs-operator pod on provider and consumer clusters are 800Mi on the clusters upgraded from v2.0.5 to v2.0.6.

    resources:
      limits:
        cpu: 200m
        memory: 800Mi
      requests:
        cpu: 200m
        memory: 800Mi

ocs-operator pod yaml on provider cluster upgraded from v2.0.5 to v2.0.6 : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-pl-pr/jijoy-pl-pr_20220912T053302/logs/testcases_1663044258/ocs_must_gather/quay-io-ocs-dev-ocs-must-gather-sha256-1944ed7522c208c0f804bef8ca5d3d524f0442a3ad308f0911982a4dfe93e8c0/namespaces/openshift-storage/pods/ocs-operator-5c77756ddd-k8tsn/ocs-operator-5c77756ddd-k8tsn.yaml

ocs-operator pod yaml on consumer cluster upgraded from v2.0.5 to v2.0.6 http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-c1/jijoy-c1_20220912T050355/logs/testcases_1663001318/ocs_must_gather/quay-io-ocs-dev-ocs-must-gather-sha256-1944ed7522c208c0f804bef8ca5d3d524f0442a3ad308f0911982a4dfe93e8c0/namespaces/openshift-storage/pods/ocs-operator-6f745dd7c9-6skhd/ocs-operator-6f745dd7c9-6skhd.yaml


Verified the same on newly deployed provider and consumer clusters with v2.0.6. Deleted the ocs-operator pod to verify the new pod also have the same memory. Moving this bug to verified state.