Bug 1740937 - Pods for marketplace CatalogSource and CatalogSourceConfig consuming large amounts of memory
Summary: Pods for marketplace CatalogSource and CatalogSourceConfig consuming large am...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.z
Assignee: Evan Cordell
QA Contact: Fan Jia
URL:
Whiteboard:
Depends On: 1746197
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-13 22:00 UTC by Naveen Malik
Modified: 2019-09-30 18:55 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1746197 (view as bug list)
Environment:
Last Closed: 2019-09-25 07:27:53 UTC
Target Upstream Version:


Attachments (Terms of Use)
cblecker-4x: last 2 weeks of metric (143.54 KB, image/png)
2019-08-13 22:06 UTC, Naveen Malik
no flags Details
cblecker-4x: from when the 4.1.9 upgrade completed until memory stabilized (85.29 KB, image/png)
2019-08-13 22:07 UTC, Naveen Malik
no flags Details
cluster-with-operators: last 2 weeks of query (127.27 KB, image/png)
2019-08-13 22:12 UTC, Naveen Malik
no flags Details
example catalog-operator log from 4.1.15 cluster with this problem (6.70 MB, text/plain)
2019-09-19 12:58 UTC, Naveen Malik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2820 0 None None None 2019-09-25 07:28:01 UTC

Description Naveen Malik 2019-08-13 22:00:07 UTC
Description of problem:
On clusters without operators installed I see the marketplace CatalogSources are consuming each 1.5G RSS as reported by container_memory_rss metric.  There are 3 of these pods, total consumption is over 4.5G.  On startup the pods do not use much, 64M, but grow steadily over 8 hours to cap above 1.5G each.

Note this is seen to consume above 3G each on some clusters.

In addition, the "installed operators" pods are consuming similar memory when operators are installed.  One example I have is consuming 8.8G RSS total across the following containers in openshift-marketplace:
- certified-operators
- community-operators
- installed-redhat-openshift-logging
- installed-openshift-operators
- redhat-operators


Version-Release number of selected component (if applicable):
4.1.9

How reproducible:
Seems to be happening on any long lived clusters that have been upgraded over the last month.  A fresh cluster installed today does not show more than 500M used.

Steps to Reproduce:
1. Install OCP
2. Let cluster age or upgrade it?  Unsure what is causing the max memory consumed to be higher on these older clusters.


Actual results:
Heavy memory consumption.

Expected results:
Low memory consumption.



Additional info:
Some long lived clusters show high memory consumption in the past but lower consumption since late last week.  Memory consumption does not directly correspond w/ the cluster being upgraded to 4.1.9.  I'll put additional details in comments for each of the scenarios I can document.

Comment 1 Naveen Malik 2019-08-13 22:06:20 UTC
Cluster: cblecker-4x
Env: stage
Created: 6/5/2019 3:43:09 PM
Current version: 4.1.9
History from ClusterVersion for latest version:
    - completionTime: "2019-08-07T17:51:15Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:27fd24c705d1107cc73cb7dda8257fe97900e130b68afc314d0ef0e31bcf9b8e
      startedTime: "2019-08-07T17:12:00Z"
      state: Completed
      verified: true
      version: 4.1.9

Query in screenshots: container_memory_rss{namespace="openshift-marketplace",container_name!="",container_name!="POD",container_name!="marketplace-operator"}

Screenshots:

cblecker-4x-2w.png - last 2 weeks of query

cblecker-4x-upgrade-to-stable.png - from when the 4.1.9 upgrade completed until memory stabilized


This shows the containers were growing post upgrade, were restarted a few times and eventually stabilized at a good memory consumption.

Comment 2 Naveen Malik 2019-08-13 22:06:51 UTC
Created attachment 1603556 [details]
cblecker-4x: last 2 weeks of metric

Comment 3 Naveen Malik 2019-08-13 22:07:20 UTC
Created attachment 1603557 [details]
cblecker-4x: from when the 4.1.9 upgrade completed until memory stabilized

Comment 4 Naveen Malik 2019-08-13 22:11:41 UTC
Cluster: example with operators installed
Env: production
Created: 2019-07-16T22:03:14Z
Current version: 4.1.9
History from ClusterVersion for latest version:
    - completionTime: "2019-08-08T20:48:40Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:27fd24c705d1107cc73cb7dda8257fe97900e130b68afc314d0ef0e31bcf9b8e
      startedTime: "2019-08-08T14:36:40Z"
      state: Completed
      verified: true
      version: 4.1.9

Query in screenshots: container_memory_rss{namespace="openshift-marketplace",container_name!="",container_name!="POD",container_name!="marketplace-operator"}

Screenshots:

cluster-with-operators-2w.png - last 2 weeks of query


This shows the containers are growing post upgrade.  In addition, the cluster was used to install operators via OperatorHub late last week.  Each of the lines is a pod in openshift-marketplace namespace:
- certified-operators
- community-operators
- installed-redhat-openshift-logging
- installed-openshift-operators
- redhat-operators

Comment 5 Naveen Malik 2019-08-13 22:12:14 UTC
Created attachment 1603558 [details]
cluster-with-operators: last 2 weeks of query

Comment 6 Jian Zhang 2019-08-14 02:44:19 UTC
Hi, Naveen 

Thanks for your reporting this issue. We have a cluster which running one day, but we don't find this issue. We will keep an eye on it, thanks!

mac:~ jianzhang$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.11    True        False         16h     Cluster version is 4.1.11
mac:~ jianzhang$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-129-68.us-east-2.compute.internal    Ready    master   19h   v1.13.4+d81afa6ba
ip-10-0-141-168.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-153-224.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-155-205.us-east-2.compute.internal   Ready    master   19h   v1.13.4+d81afa6ba
ip-10-0-164-116.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-174-123.us-east-2.compute.internal   Ready    master   19h   v1.13.4+d81afa6ba
mac:~ jianzhang$ oc adm top pods
NAME                                    CPU(cores)   MEMORY(bytes)   
certified-operators-6bcdc96b-lzvd9      2m           22Mi            
community-operators-655bb9cd-h9fn7      2m           68Mi            
marketplace-operator-7df66dbf67-d7829   2m           14Mi            
redhat-operators-7c4b9f9f6f-b978p       3m           40Mi

Comment 8 Evan Cordell 2019-09-14 15:00:32 UTC
We think that the grpc library backports we performed fix this in 4.1.15. Moving to modified for that reason.

Comment 11 Naveen Malik 2019-09-18 13:09:21 UTC
Still seeing this issue on a cluster upgraded to 4.1.15.

Comment 12 Naveen Malik 2019-09-19 12:58:35 UTC
Created attachment 1616712 [details]
example catalog-operator log from 4.1.15 cluster with this problem

Comment 14 errata-xmlrpc 2019-09-25 07:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2820


Note You need to log in before you can comment on or make changes to this bug.