Bug 2203788

Summary: [FaaS Agent] Rook ceph operator goes to OOMKilled state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: suchita <sgatfane>
Component: odf-managed-serviceAssignee: Rewant <resoni>
Status: ON_QA --- QA Contact: suchita <sgatfane>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.11CC: dbindra, muagarwa, nberry, odf-bz-bot, omitrani, owasserm, resoni, sgatfane, tnielsen
Target Milestone: ---Keywords: Tracking
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2193367 Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2193367    
Bug Blocks:    

Description suchita 2023-05-15 09:04:38 UTC
Cloning this BZ to track the Fix in FaaS

+++ This bug was initially created as a clone of Bug #2193367 +++

Description of problem (please be detailed as possible and provide log
snippests):

We(managed service team) have production clusters on OCP 4.10.15/4.10.16 size 20TiB, when we upgrade them to OCP 4.11, the rook-ceph-operator goes into OOMKilled state.

The OOMKilled state is there right after ceph detect version pod is Terminated.

On 4.10.16 cluster, after upgrade we restarted the rook-ceph-operator and it worked. on 4.10.15 cluster, after upgrade and restart it still goes to OOMKilled state.

We also upgraded a cluster from 4.10.52 size 4TiB to 4.11.38, in which we didn't have to restart the rook-ceph-operator.

We looked into logs for rook-ceph-operator and we found only INFO logs.

We have resourceLimit defined by the ocs-osd deployer: 
https://github.com/red-hat-storage/ocs-osd-deployer/blob/main/utils/resources.go

	"rook-ceph-operator": {
		Limits: corev1.ResourceList{
			"cpu":    resource.MustParse("300m"),
			"memory": resource.MustParse("200Mi"),
		},
		Requests: corev1.ResourceList{
			"cpu":    resource.MustParse("300m"),
			"memory": resource.MustParse("200Mi"),
		},
	},

OHSS ticket for the incident: https://issues.redhat.com/browse/OHSS-21838.

Version of all relevant components (if applicable):
OCP version 4.11.38
ODF version 4.10.9

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:
Rook ceph operator goes to OOMKilled

Expected results:
Rook ceph operator should be in running state.

Additional info:

--- Additional comment from RHEL Program Management on 2023-05-05 12:07:37 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-05-05 12:07:37 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Rewant on 2023-05-05 12:13:31 UTC ---



--- Additional comment from Rewant on 2023-05-05 12:13:54 UTC ---



--- Additional comment from Rewant on 2023-05-05 12:14:14 UTC ---



--- Additional comment from Travis Nielsen on 2023-05-05 15:37:04 UTC ---

200Mi is rather low memory for the operator to burst and perform some operations.

Upstream we have recommended 512Mi memory limit for the operator:
https://github.com/rook/rook/blob/master/deploy/charts/rook-ceph/values.yaml#L23-L27

Since the operator is getting OOM-killed, you'll need to increase the memory.

--- Additional comment from Ohad on 2023-05-06 09:00:50 UTC ---

Thank you @ for the answer.

But I would like to mention that we

--- Additional comment from Ohad on 2023-05-06 09:06:25 UTC ---

Thank you @tnielsen for the answer

But I would like to mention that we are running with this setup for at least a year on 5 different clusters on production and countless more on staging, and we never encountered this OOM issue.
The only change here is the version of ROSA (OCP) from 4.10.z to 4.11.z. We didn't upgrade or changed ODF, OCS, or Rook. 

So it is really strange that this situation manifests itself just now.
Is there a way to figure out what operation is trying to consume the memory? I don't believe in just raising up memory limits without an explicit explanation for the need. 
Mainly because we might encounter something like this again even after raising the mem limits if we don't know the underlying reason. 

Could you help us identify the root cause?

--- Additional comment from Travis Nielsen on 2023-05-08 17:02:01 UTC ---

4.11 is where we also saw the need to increase the memory limits, and there has not been another update necessary since then. 
The change was made upstream here: https://github.com/rook/rook/pull/10195
Based on the upstream issue: https://github.com/rook/rook/issues/10192

Full memory analysis was not done at the time, but the most likely related change was the update of the controller runtime that rook was using in that release.
It's possible there was some rook feature in that release that triggered the increase in resources, but nothing stands out.

--- Additional comment from Ohad on 2023-05-08 19:18:55 UTC ---

@tnielsen 

We are running ODF 4.10 which brings Rook 4.10. The upgrade was to the OCP version, not the ODF version. 
There is a planned update from ODF 4.10 to ODF 4.11 but we are blocked by this issue, as the deployer, which is responsible for updating ODF, is scaled down.

On top of that, we have no headroom to jump to 512Mi as we are running on ROSA with a very tight fit. 
We will need to lower some other pod's memory requests and limits to get there and I am not sure we have any we can touch.

What would you suggest?

--- Additional comment from Travis Nielsen on 2023-05-08 19:30:20 UTC ---

Since bursts are so important to the operator, upstream we have set the limits to 4x the requests for memory. Any reason this pattern wouldn't work for you?

--- Additional comment from Ohad on 2023-05-08 19:34:16 UTC ---

ROSA guidelines, and a requirement for addon-based software, limits our ability to set memory limits to be different than memory request for each and every pod of our deployment

--- Additional comment from Rewant on 2023-05-09 06:40:00 UTC ---

ocs-osd-controller-manager-7dfc54b4f5-vg2b2                       2/3     Running           0          34m
rook-ceph-operator-799967b457-4zdbl                               0/1     OOMKilled         0          78s
rook-ceph-operator-799967b457-4zdbl                               1/1     Running           1 (2s ago)   79s
ceph-file-controller-detect-version-2qnzs                         0/1     Pending           0            0s
ceph-file-controller-detect-version-2qnzs                         0/1     Pending           0            0s

Captured the previous debug logs after the operator goes into OOMKilled state,

--- Additional comment from Travis Nielsen on 2023-05-09 15:17:45 UTC ---

The original (non-debug) operator log from comment 5 showed that the operator died while it was just starting to reconcile the mons.

From the latest debug logs, the operator was able to reconcile the mons, mgr, and then died while reconciling the OSDs.

This new repro was able to reconcile almost a minute longer than the first repro. 
The common line I looked for to compare between logs was "parsing mon endpoints".

The new repro is not reconciling any cephfs subvolume groups. It appears the earlier repro had more to reconcile in parallel, so hit the memory limit sooner. 
The OSD reconcile does use more memory to process the OSDs in parallel, so it doesn't surprise me that the limit was hit during OSD reconcile.

The operator is reconciling its CRs as expected, and the controllers are running in parallel as expected. If it helped, we could look at reconcile the OSDs more serially to remove some memory pressure, but that doesn't help with the controllers that are running in parallel. I don't see what could be changed to run under lower memory constraints. The operator by design needs to burst.