1989710 – Catalog operator wastes memory by caching complete copied CSVs

Bug 1989710 - Catalog operator wastes memory by caching complete copied CSVs

Summary: Catalog operator wastes memory by caching complete copied CSVs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Ben Luddy
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-03 18:35 UTC by Ben Luddy
Modified:	2021-10-18 17:44 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:44:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift operator-framework-olm pull 149	0	None	None	None	2021-08-03 18:38:21 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:44:37 UTC

Description Ben Luddy 2021-08-03 18:35:50 UTC

Description of problem:

CSV objects can be large (tens or even hundreds of kB), and olm-operator creates clones of each CSV into all of its target namespaces. On clusters with many namespaces -- sometimes as many as tens of thousands -- there can easily exist tens of thousands of CSVs, most of them nearly identical.

The catalog operator watches CSVs in order to be aware of currently-installed operators, but it does not modify them or read many of their fields, so caching the complete contents of all CSVs can demand significant amounts of memory in situations like the one described above.

Replacing the "CSV copying" mechanism with something more scalable will eventually resolve this problem, but it's possible to make significant and immediate improvements to the memory cost by caching only a view of the CSV fields used by the catalog operator.

Steps to reproduce:

1. Create 5000 namespaces.
2. Create an OperatorGroup in one namespace that targets all namespaces.
3. Create a CSV (by directly applying a CSV manifest, not by creating a Subscription) that supports the AllNamespaces install mode in same namespace as the all-namespace OperatorGroup.
4. Wait for the olm-operator to make copies of the CSV in all targeted namespaces.
5. Look at the amount of memory in use by the catalog-operator pod ("oc adm top pod" would be one way to find this).

Comment 2 Jian Zhang 2021-08-05 14:19:54 UTC

1, Create an OCP 4.9, 3 masters, 3 workers.
[cloud-user@preserve-olm-env jian]$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-136-98.us-west-2.compute.internal    Ready    master   50m   v1.21.1+8268f88
ip-10-0-148-50.us-west-2.compute.internal    Ready    worker   43m   v1.21.1+8268f88
ip-10-0-171-254.us-west-2.compute.internal   Ready    worker   41m   v1.21.1+8268f88
ip-10-0-191-69.us-west-2.compute.internal    Ready    master   50m   v1.21.1+8268f88
ip-10-0-243-209.us-west-2.compute.internal   Ready    worker   43m   v1.21.1+8268f88
ip-10-0-249-78.us-west-2.compute.internal    Ready    master   50m   v1.21.1+8268f88

[cloud-user@preserve-olm-env jian]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-04-204446   True        False         28m     Cluster version is 4.9.0-0.nightly-2021-08-04-204446
[cloud-user@preserve-olm-env jian]$ oc exec catalog-operator-5df5bf4dbf-8bpqd -- olm --version
OLM version: 0.18.3
git commit: 9889a1ad78d06bab260519fcef529bf6aff029cb


2, Check the used Memory size before creating CSV.

[cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager adm top pod catalog-operator-5df5bf4dbf-8bpqd 
W0805 06:10:11.128071 3850407 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                                CPU(cores)   MEMORY(bytes)   
catalog-operator-5df5bf4dbf-8bpqd   0m           139Mi 

3, Create a CSV in the "openshift-operators" project.
[cloud-user@preserve-olm-env jian]$ oc get csv -n openshift-operators
NAME                              DISPLAY   VERSION             REPLACES                          PHASE
etcdoperator.v0.9.4-clusterwide   etcd      0.9.4-clusterwide   etcdoperator.v0.9.2-clusterwide   Pending

4, Create 1000 namespaces.
[cloud-user@preserve-olm-env jian]$ for l in {1..2000}; do oc adm new-project test${l}; sleep 1; done; 
Created project test1
...

5, Check the used Memory size. Memory used desn't increas too much. Looks good, verify it.
[cloud-user@preserve-olm-env jian]$ oc adm top pod catalog-operator-5df5bf4dbf-8bpqd 
W0805 07:06:28.778609 3872399 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                                CPU(cores)   MEMORY(bytes)   
catalog-operator-5df5bf4dbf-8bpqd   7m           226Mi

Comment 5 errata-xmlrpc 2021-10-18 17:44:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.