Created attachment 1603513 [details] container_memory_rss{namespace="openshift-operator-lifecycle-manager",container_name!="",container_name!="POD",container_name!="marketplace-operator"}>4*1024*1024*1024 Description of problem: The "catalog-operator" pod on OSD v4.1 clusters show memory consumption above 11GB RSS over a period of time. It takes about 12 hours to get to this point. Latest metrics on a cluster that's not doing anything active shows consumption of about 11GB of RSS. Screenshot will be attached. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.9 True False 5d2h Cluster version is 4.1.9 How reproducible: Appears consistently a problem on at least 4 clusters, I have not confirmed everywhere Steps to Reproduce: 1. Install cluster. 2. Wait. Actual results: catalog-operator consumes 11GB+ memory. Expected results: catalog-operator has small memory footprint. Additional info:
I have a cluster which running almost one day. But, I don't find this issue. I will keep an eye on it. mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.11 True False 16h Cluster version is 4.1.11 mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-129-68.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba ip-10-0-141-168.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba ip-10-0-153-224.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba ip-10-0-155-205.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba ip-10-0-164-116.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba ip-10-0-174-123.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba mac:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-64556ffff5-wq99z 1/1 Running 0 16h olm-operator-6ff7dbf564-zvw92 1/1 Running 0 16h olm-operators-85r4c 1/1 Running 0 16h packageserver-6fd666d6b9-gw9mm 1/1 Running 0 26m packageserver-6fd666d6b9-mpdf4 1/1 Running 0 26m mac:~ jianzhang$ oc adm top pod catalog-operator-64556ffff5-wq99z NAME CPU(cores) MEMORY(bytes) catalog-operator-64556ffff5-wq99z 1m 36Mi
Created attachment 1603802 [details] oc get clusterversion version -o yaml ClusterVersion attached for context on cluster age and upgrades over time.
Created attachment 1608623 [details] must-gather.partaa
Created attachment 1608624 [details] must-gather.partah
Created attachment 1608625 [details] must-gather.partao
Created attachment 1608626 [details] must-gather.partav
Created attachment 1608627 [details] must-gather.partbc
Created attachment 1608628 [details] must-gather.partab
Created attachment 1608629 [details] must-gather.partai
Created attachment 1608630 [details] must-gather.partap
Created attachment 1608631 [details] must-gather.partaw
Created attachment 1608633 [details] must-gather.partbd
Created attachment 1608634 [details] must-gather.partac
Created attachment 1608635 [details] must-gather.partaj
Created attachment 1608636 [details] must-gather.partaq
Created attachment 1608637 [details] must-gather.partax
Created attachment 1608638 [details] must-gather.partad
Created attachment 1608639 [details] must-gather.partak
Created attachment 1608640 [details] must-gather.partar
Created attachment 1608641 [details] must-gather.partay
Created attachment 1608642 [details] must-gather.partae
Created attachment 1608643 [details] must-gather.partal
Created attachment 1608644 [details] must-gather.partas
Created attachment 1608645 [details] must-gather.partaz
Created attachment 1608646 [details] must-gather.partaf
Created attachment 1608647 [details] must-gather.partam
Created attachment 1608648 [details] must-gather.partat
Created attachment 1608649 [details] must-gather.partba
Created attachment 1608650 [details] must-gather.partag
Created attachment 1608652 [details] must-gather.partan
Created attachment 1608653 [details] must-gather.partau
Created attachment 1608654 [details] must-gather.partbb
Created attachment 1609008 [details] Pod top output, per Brenton's request Marking must-gather parts as obsolete as Brenton noted they do not contain the data needed to investigate.
Seeing this still on an OSD 4.1.14 cluster. On version 4.1.9 the cluster showed a steady state usage of memory for catalog-operator pod. Since upgrading to 4.1.13 and now upgraded to 4.1.14 the catalog operator pod memory usage grows until it's OOMKilled. I'll attach a graph for the last 2 weeks with the ClusterVersion for the cluster for upgrade history. Evan, can you update on expectations for a fix?
Created attachment 1614994 [details] container_memory_rss for OSD 4.1.14 prod cluster
Created attachment 1614996 [details] clusterversion for OSD 4.1.14 prod cluster
We believe that the source of the memory leak is an issue in the grpc libraries, and that those leaks get triggered very frequently on 4.1 due to the way we were managing grpc connections. We have already backported grpc library updates in 4.1.15 that should address the source of memory leak. In 4.2 we have refactored the way we use those libraries to reduce the chance of triggering the leaks in the first place. If the issue still occurs on 4.1.15, the next step would be to backport the refactored grpc connection handling from 4.2. But in theory, 4.1.15 should be fixed, and we'd like to avoid backporting the refactor if possible. I will move to modified since the grpc library backport should fix it.
@Evan should the pod have requests and limits set? We see nothing right now on 4.1.18 clusters.
*** Bug 1757924 has been marked as a duplicate of this bug. ***
Checked on all OSD v4 production clusters. They're running 4.1.21 or greater. Ran this query: container_memory_rss{namespace="openshift-marketplace",container_name!="",container_name!="POD",container_name!="marketplace-operator"}/1024/1024/1024>0.1 And all clusters had empty results, meaning low memory consumption. From the OSD point of view I am calling this verified. Thanks!!
Marking verified on 4.1.21. @nmalik, thanks for the assist with verification.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3875