Description of problem: CSV objects can be large (tens or even hundreds of kB), and olm-operator creates clones of each CSV into all of its target namespaces. On clusters with many namespaces -- sometimes as many as tens of thousands -- there can easily exist tens of thousands of CSVs, most of them nearly identical. The catalog operator watches CSVs in order to be aware of currently-installed operators, but it does not modify them or read many of their fields, so caching the complete contents of all CSVs can demand significant amounts of memory in situations like the one described above. Replacing the "CSV copying" mechanism with something more scalable will eventually resolve this problem, but it's possible to make significant and immediate improvements to the memory cost by caching only a view of the CSV fields used by the catalog operator. Steps to reproduce: 1. Create 5000 namespaces. 2. Create an OperatorGroup in one namespace that targets all namespaces. 3. Create a CSV (by directly applying a CSV manifest, not by creating a Subscription) that supports the AllNamespaces install mode in same namespace as the all-namespace OperatorGroup. 4. Wait for the olm-operator to make copies of the CSV in all targeted namespaces. 5. Look at the amount of memory in use by the catalog-operator pod ("oc adm top pod" would be one way to find this).
1, Create an OCP 4.9, 3 masters, 3 workers. [cloud-user@preserve-olm-env jian]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-136-98.us-west-2.compute.internal Ready master 50m v1.21.1+8268f88 ip-10-0-148-50.us-west-2.compute.internal Ready worker 43m v1.21.1+8268f88 ip-10-0-171-254.us-west-2.compute.internal Ready worker 41m v1.21.1+8268f88 ip-10-0-191-69.us-west-2.compute.internal Ready master 50m v1.21.1+8268f88 ip-10-0-243-209.us-west-2.compute.internal Ready worker 43m v1.21.1+8268f88 ip-10-0-249-78.us-west-2.compute.internal Ready master 50m v1.21.1+8268f88 [cloud-user@preserve-olm-env jian]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-04-204446 True False 28m Cluster version is 4.9.0-0.nightly-2021-08-04-204446 [cloud-user@preserve-olm-env jian]$ oc exec catalog-operator-5df5bf4dbf-8bpqd -- olm --version OLM version: 0.18.3 git commit: 9889a1ad78d06bab260519fcef529bf6aff029cb 2, Check the used Memory size before creating CSV. [cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager adm top pod catalog-operator-5df5bf4dbf-8bpqd W0805 06:10:11.128071 3850407 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag NAME CPU(cores) MEMORY(bytes) catalog-operator-5df5bf4dbf-8bpqd 0m 139Mi 3, Create a CSV in the "openshift-operators" project. [cloud-user@preserve-olm-env jian]$ oc get csv -n openshift-operators NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4-clusterwide etcd 0.9.4-clusterwide etcdoperator.v0.9.2-clusterwide Pending 4, Create 1000 namespaces. [cloud-user@preserve-olm-env jian]$ for l in {1..2000}; do oc adm new-project test${l}; sleep 1; done; Created project test1 ... 5, Check the used Memory size. Memory used desn't increas too much. Looks good, verify it. [cloud-user@preserve-olm-env jian]$ oc adm top pod catalog-operator-5df5bf4dbf-8bpqd W0805 07:06:28.778609 3872399 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag NAME CPU(cores) MEMORY(bytes) catalog-operator-5df5bf4dbf-8bpqd 7m 226Mi
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759