2193367 – Rook ceph operator goes to OOMKilled state

Bug 2193367 - Rook ceph operator goes to OOMKilled state

Summary: Rook ceph operator goes to OOMKilled state

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2203788
TreeView+	depends on / blocked

Reported:	2023-05-05 12:07 UTC by Rewant
Modified:	2023-08-09 17:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2203788 (view as bug list)
Environment:
Last Closed:	2023-05-17 04:44:22 UTC
Embargoed:

Attachments	(Terms of Use)

Description Rewant 2023-05-05 12:07:30 UTC

Description of problem (please be detailed as possible and provide log
snippests):

We(managed service team) have production clusters on OCP 4.10.15/4.10.16 size 20TiB, when we upgrade them to OCP 4.11, the rook-ceph-operator goes into OOMKilled state.

The OOMKilled state is there right after ceph detect version pod is Terminated.

On 4.10.16 cluster, after upgrade we restarted the rook-ceph-operator and it worked. on 4.10.15 cluster, after upgrade and restart it still goes to OOMKilled state.

We also upgraded a cluster from 4.10.52 size 4TiB to 4.11.38, in which we didn't have to restart the rook-ceph-operator.

We looked into logs for rook-ceph-operator and we found only INFO logs.

We have resourceLimit defined by the ocs-osd deployer:
https://github.com/red-hat-storage/ocs-osd-deployer/blob/main/utils/resources.go

"rook-ceph-operator": {
Limits: corev1.ResourceList{
"cpu": resource.MustParse("300m"),
"memory": resource.MustParse("200Mi"),
},
Requests: corev1.ResourceList{
"cpu": resource.MustParse("300m"),
"memory": resource.MustParse("200Mi"),
},
},

OHSS ticket for the incident: https://issues.redhat.com/browse/OHSS-21838.

Version of all relevant components (if applicable):
OCP version 4.11.38
ODF version 4.10.9

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1.
2.
3.

Actual results:
Rook ceph operator goes to OOMKilled

Expected results:
Rook ceph operator should be in running state.

Additional info:

Comment 6 Travis Nielsen 2023-05-05 15:37:04 UTC

200Mi is rather low memory for the operator to burst and perform some operations.

Upstream we have recommended 512Mi memory limit for the operator:
https://github.com/rook/rook/blob/master/deploy/charts/rook-ceph/values.yaml#L23-L27

Since the operator is getting OOM-killed, you'll need to increase the memory.

Comment 8 Ohad 2023-05-06 09:06:25 UTC

Thank you @tnielsen for the answer

But I would like to mention that we are running with this setup for at least a year on 5 different clusters on production and countless more on staging, and we never encountered this OOM issue.
The only change here is the version of ROSA (OCP) from 4.10.z to 4.11.z. We didn't upgrade or changed ODF, OCS, or Rook. 

So it is really strange that this situation manifests itself just now.
Is there a way to figure out what operation is trying to consume the memory? I don't believe in just raising up memory limits without an explicit explanation for the need. 
Mainly because we might encounter something like this again even after raising the mem limits if we don't know the underlying reason. 

Could you help us identify the root cause?

Comment 9 Travis Nielsen 2023-05-08 17:02:01 UTC

4.11 is where we also saw the need to increase the memory limits, and there has not been another update necessary since then. 
The change was made upstream here: https://github.com/rook/rook/pull/10195
Based on the upstream issue: https://github.com/rook/rook/issues/10192

Full memory analysis was not done at the time, but the most likely related change was the update of the controller runtime that rook was using in that release.
It's possible there was some rook feature in that release that triggered the increase in resources, but nothing stands out.

Comment 10 Ohad 2023-05-08 19:18:55 UTC

@tnielsen 

We are running ODF 4.10 which brings Rook 4.10. The upgrade was to the OCP version, not the ODF version. 
There is a planned update from ODF 4.10 to ODF 4.11 but we are blocked by this issue, as the deployer, which is responsible for updating ODF, is scaled down.

On top of that, we have no headroom to jump to 512Mi as we are running on ROSA with a very tight fit. 
We will need to lower some other pod's memory requests and limits to get there and I am not sure we have any we can touch.

What would you suggest?

Comment 11 Travis Nielsen 2023-05-08 19:30:20 UTC

Since bursts are so important to the operator, upstream we have set the limits to 4x the requests for memory. Any reason this pattern wouldn't work for you?

Comment 12 Ohad 2023-05-08 19:34:16 UTC

ROSA guidelines, and a requirement for addon-based software, limits our ability to set memory limits to be different than memory request for each and every pod of our deployment

Comment 19 Rewant 2023-05-17 04:44:22 UTC

Yes, we have increased the memory from Managed Service and it's working. We can close this.

Note You need to log in before you can comment on or make changes to this bug.