Created attachment 1603513 [details]
Description of problem:
The "catalog-operator" pod on OSD v4.1 clusters show memory consumption above 11GB RSS over a period of time. It takes about 12 hours to get to this point. Latest metrics on a cluster that's not doing anything active shows consumption of about 11GB of RSS. Screenshot will be attached.
Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.9 True False 5d2h Cluster version is 4.1.9
Appears consistently a problem on at least 4 clusters, I have not confirmed everywhere
Steps to Reproduce:
1. Install cluster.
catalog-operator consumes 11GB+ memory.
catalog-operator has small memory footprint.
I have a cluster which running almost one day. But, I don't find this issue. I will keep an eye on it.
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.11 True False 16h Cluster version is 4.1.11
mac:~ jianzhang$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-129-68.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba
ip-10-0-141-168.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba
ip-10-0-153-224.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba
ip-10-0-155-205.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba
ip-10-0-164-116.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba
ip-10-0-174-123.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba
mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-64556ffff5-wq99z 1/1 Running 0 16h
olm-operator-6ff7dbf564-zvw92 1/1 Running 0 16h
olm-operators-85r4c 1/1 Running 0 16h
packageserver-6fd666d6b9-gw9mm 1/1 Running 0 26m
packageserver-6fd666d6b9-mpdf4 1/1 Running 0 26m
mac:~ jianzhang$ oc adm top pod catalog-operator-64556ffff5-wq99z
NAME CPU(cores) MEMORY(bytes)
catalog-operator-64556ffff5-wq99z 1m 36Mi
Created attachment 1603802 [details]
oc get clusterversion version -o yaml
ClusterVersion attached for context on cluster age and upgrades over time.
Created attachment 1608623 [details]
Created attachment 1608624 [details]
Created attachment 1608625 [details]
Created attachment 1608626 [details]
Created attachment 1608627 [details]
Created attachment 1608628 [details]
Created attachment 1608629 [details]
Created attachment 1608630 [details]
Created attachment 1608631 [details]
Created attachment 1608633 [details]
Created attachment 1608634 [details]
Created attachment 1608635 [details]
Created attachment 1608636 [details]
Created attachment 1608637 [details]
Created attachment 1608638 [details]
Created attachment 1608639 [details]
Created attachment 1608640 [details]
Created attachment 1608641 [details]
Created attachment 1608642 [details]
Created attachment 1608643 [details]
Created attachment 1608644 [details]
Created attachment 1608645 [details]
Created attachment 1608646 [details]
Created attachment 1608647 [details]
Created attachment 1608648 [details]
Created attachment 1608649 [details]
Created attachment 1608650 [details]
Created attachment 1608652 [details]
Created attachment 1608653 [details]
Created attachment 1608654 [details]
Created attachment 1609008 [details]
Pod top output, per Brenton's request
Marking must-gather parts as obsolete as Brenton noted they do not contain the data needed to investigate.
Seeing this still on an OSD 4.1.14 cluster.
On version 4.1.9 the cluster showed a steady state usage of memory for catalog-operator pod. Since upgrading to 4.1.13 and now upgraded to 4.1.14 the catalog operator pod memory usage grows until it's OOMKilled.
I'll attach a graph for the last 2 weeks with the ClusterVersion for the cluster for upgrade history.
Evan, can you update on expectations for a fix?
Created attachment 1614994 [details]
container_memory_rss for OSD 4.1.14 prod cluster
Created attachment 1614996 [details]
clusterversion for OSD 4.1.14 prod cluster
We believe that the source of the memory leak is an issue in the grpc libraries, and that those leaks get triggered very frequently on 4.1 due to the way we were managing grpc connections.
We have already backported grpc library updates in 4.1.15 that should address the source of memory leak. In 4.2 we have refactored the way we use those libraries to reduce the chance of triggering the leaks in the first place.
If the issue still occurs on 4.1.15, the next step would be to backport the refactored grpc connection handling from 4.2. But in theory, 4.1.15 should be fixed, and we'd like to avoid backporting the refactor if possible.
I will move to modified since the grpc library backport should fix it.
@Evan should the pod have requests and limits set? We see nothing right now on 4.1.18 clusters.
*** Bug 1757924 has been marked as a duplicate of this bug. ***
Checked on all OSD v4 production clusters. They're running 4.1.21 or greater. Ran this query:
And all clusters had empty results, meaning low memory consumption.
From the OSD point of view I am calling this verified. Thanks!!
Marking verified on 4.1.21. @nmalik, thanks for the assist with verification.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.