Bug 1989710
| Summary: | Catalog operator wastes memory by caching complete copied CSVs | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Luddy <bluddy> |
| Component: | OLM | Assignee: | Ben Luddy <bluddy> |
| OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | tflannag |
| Version: | 4.9 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:44:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
1, Create an OCP 4.9, 3 masters, 3 workers.
[cloud-user@preserve-olm-env jian]$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-136-98.us-west-2.compute.internal Ready master 50m v1.21.1+8268f88
ip-10-0-148-50.us-west-2.compute.internal Ready worker 43m v1.21.1+8268f88
ip-10-0-171-254.us-west-2.compute.internal Ready worker 41m v1.21.1+8268f88
ip-10-0-191-69.us-west-2.compute.internal Ready master 50m v1.21.1+8268f88
ip-10-0-243-209.us-west-2.compute.internal Ready worker 43m v1.21.1+8268f88
ip-10-0-249-78.us-west-2.compute.internal Ready master 50m v1.21.1+8268f88
[cloud-user@preserve-olm-env jian]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.0-0.nightly-2021-08-04-204446 True False 28m Cluster version is 4.9.0-0.nightly-2021-08-04-204446
[cloud-user@preserve-olm-env jian]$ oc exec catalog-operator-5df5bf4dbf-8bpqd -- olm --version
OLM version: 0.18.3
git commit: 9889a1ad78d06bab260519fcef529bf6aff029cb
2, Check the used Memory size before creating CSV.
[cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager adm top pod catalog-operator-5df5bf4dbf-8bpqd
W0805 06:10:11.128071 3850407 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME CPU(cores) MEMORY(bytes)
catalog-operator-5df5bf4dbf-8bpqd 0m 139Mi
3, Create a CSV in the "openshift-operators" project.
[cloud-user@preserve-olm-env jian]$ oc get csv -n openshift-operators
NAME DISPLAY VERSION REPLACES PHASE
etcdoperator.v0.9.4-clusterwide etcd 0.9.4-clusterwide etcdoperator.v0.9.2-clusterwide Pending
4, Create 1000 namespaces.
[cloud-user@preserve-olm-env jian]$ for l in {1..2000}; do oc adm new-project test${l}; sleep 1; done;
Created project test1
...
5, Check the used Memory size. Memory used desn't increas too much. Looks good, verify it.
[cloud-user@preserve-olm-env jian]$ oc adm top pod catalog-operator-5df5bf4dbf-8bpqd
W0805 07:06:28.778609 3872399 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME CPU(cores) MEMORY(bytes)
catalog-operator-5df5bf4dbf-8bpqd 7m 226Mi
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |
Description of problem: CSV objects can be large (tens or even hundreds of kB), and olm-operator creates clones of each CSV into all of its target namespaces. On clusters with many namespaces -- sometimes as many as tens of thousands -- there can easily exist tens of thousands of CSVs, most of them nearly identical. The catalog operator watches CSVs in order to be aware of currently-installed operators, but it does not modify them or read many of their fields, so caching the complete contents of all CSVs can demand significant amounts of memory in situations like the one described above. Replacing the "CSV copying" mechanism with something more scalable will eventually resolve this problem, but it's possible to make significant and immediate improvements to the memory cost by caching only a view of the CSV fields used by the catalog operator. Steps to reproduce: 1. Create 5000 namespaces. 2. Create an OperatorGroup in one namespace that targets all namespaces. 3. Create a CSV (by directly applying a CSV manifest, not by creating a Subscription) that supports the AllNamespaces install mode in same namespace as the all-namespace OperatorGroup. 4. Wait for the olm-operator to make copies of the CSV in all targeted namespaces. 5. Look at the amount of memory in use by the catalog-operator pod ("oc adm top pod" would be one way to find this).