Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1740937

Summary: Pods for marketplace CatalogSource and CatalogSourceConfig consuming large amounts of memory
Product: OpenShift Container Platform Reporter: Naveen Malik <nmalik>
Component: OLMAssignee: Evan Cordell <ecordell>
OLM sub component: OperatorHub QA Contact: Fan Jia <jfan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bandrade, cblecker, chuo, jeder, scolange
Version: 4.1.z   
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1746197 (view as bug list) Environment:
Last Closed: 2019-09-25 07:27:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1746197    
Bug Blocks:    
Attachments:
Description Flags
cblecker-4x: last 2 weeks of metric
none
cblecker-4x: from when the 4.1.9 upgrade completed until memory stabilized
none
cluster-with-operators: last 2 weeks of query
none
example catalog-operator log from 4.1.15 cluster with this problem none

Description Naveen Malik 2019-08-13 22:00:07 UTC
Description of problem:
On clusters without operators installed I see the marketplace CatalogSources are consuming each 1.5G RSS as reported by container_memory_rss metric.  There are 3 of these pods, total consumption is over 4.5G.  On startup the pods do not use much, 64M, but grow steadily over 8 hours to cap above 1.5G each.

Note this is seen to consume above 3G each on some clusters.

In addition, the "installed operators" pods are consuming similar memory when operators are installed.  One example I have is consuming 8.8G RSS total across the following containers in openshift-marketplace:
- certified-operators
- community-operators
- installed-redhat-openshift-logging
- installed-openshift-operators
- redhat-operators


Version-Release number of selected component (if applicable):
4.1.9

How reproducible:
Seems to be happening on any long lived clusters that have been upgraded over the last month.  A fresh cluster installed today does not show more than 500M used.

Steps to Reproduce:
1. Install OCP
2. Let cluster age or upgrade it?  Unsure what is causing the max memory consumed to be higher on these older clusters.


Actual results:
Heavy memory consumption.

Expected results:
Low memory consumption.



Additional info:
Some long lived clusters show high memory consumption in the past but lower consumption since late last week.  Memory consumption does not directly correspond w/ the cluster being upgraded to 4.1.9.  I'll put additional details in comments for each of the scenarios I can document.

Comment 1 Naveen Malik 2019-08-13 22:06:20 UTC
Cluster: cblecker-4x
Env: stage
Created: 6/5/2019 3:43:09 PM
Current version: 4.1.9
History from ClusterVersion for latest version:
    - completionTime: "2019-08-07T17:51:15Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:27fd24c705d1107cc73cb7dda8257fe97900e130b68afc314d0ef0e31bcf9b8e
      startedTime: "2019-08-07T17:12:00Z"
      state: Completed
      verified: true
      version: 4.1.9

Query in screenshots: container_memory_rss{namespace="openshift-marketplace",container_name!="",container_name!="POD",container_name!="marketplace-operator"}

Screenshots:

cblecker-4x-2w.png - last 2 weeks of query

cblecker-4x-upgrade-to-stable.png - from when the 4.1.9 upgrade completed until memory stabilized


This shows the containers were growing post upgrade, were restarted a few times and eventually stabilized at a good memory consumption.

Comment 2 Naveen Malik 2019-08-13 22:06:51 UTC
Created attachment 1603556 [details]
cblecker-4x: last 2 weeks of metric

Comment 3 Naveen Malik 2019-08-13 22:07:20 UTC
Created attachment 1603557 [details]
cblecker-4x: from when the 4.1.9 upgrade completed until memory stabilized

Comment 4 Naveen Malik 2019-08-13 22:11:41 UTC
Cluster: example with operators installed
Env: production
Created: 2019-07-16T22:03:14Z
Current version: 4.1.9
History from ClusterVersion for latest version:
    - completionTime: "2019-08-08T20:48:40Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:27fd24c705d1107cc73cb7dda8257fe97900e130b68afc314d0ef0e31bcf9b8e
      startedTime: "2019-08-08T14:36:40Z"
      state: Completed
      verified: true
      version: 4.1.9

Query in screenshots: container_memory_rss{namespace="openshift-marketplace",container_name!="",container_name!="POD",container_name!="marketplace-operator"}

Screenshots:

cluster-with-operators-2w.png - last 2 weeks of query


This shows the containers are growing post upgrade.  In addition, the cluster was used to install operators via OperatorHub late last week.  Each of the lines is a pod in openshift-marketplace namespace:
- certified-operators
- community-operators
- installed-redhat-openshift-logging
- installed-openshift-operators
- redhat-operators

Comment 5 Naveen Malik 2019-08-13 22:12:14 UTC
Created attachment 1603558 [details]
cluster-with-operators: last 2 weeks of query

Comment 6 Jian Zhang 2019-08-14 02:44:19 UTC
Hi, Naveen 

Thanks for your reporting this issue. We have a cluster which running one day, but we don't find this issue. We will keep an eye on it, thanks!

mac:~ jianzhang$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.11    True        False         16h     Cluster version is 4.1.11
mac:~ jianzhang$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-129-68.us-east-2.compute.internal    Ready    master   19h   v1.13.4+d81afa6ba
ip-10-0-141-168.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-153-224.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-155-205.us-east-2.compute.internal   Ready    master   19h   v1.13.4+d81afa6ba
ip-10-0-164-116.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-174-123.us-east-2.compute.internal   Ready    master   19h   v1.13.4+d81afa6ba
mac:~ jianzhang$ oc adm top pods
NAME                                    CPU(cores)   MEMORY(bytes)   
certified-operators-6bcdc96b-lzvd9      2m           22Mi            
community-operators-655bb9cd-h9fn7      2m           68Mi            
marketplace-operator-7df66dbf67-d7829   2m           14Mi            
redhat-operators-7c4b9f9f6f-b978p       3m           40Mi

Comment 8 Evan Cordell 2019-09-14 15:00:32 UTC
We think that the grpc library backports we performed fix this in 4.1.15. Moving to modified for that reason.

Comment 11 Naveen Malik 2019-09-18 13:09:21 UTC
Still seeing this issue on a cluster upgraded to 4.1.15.

Comment 12 Naveen Malik 2019-09-19 12:58:35 UTC
Created attachment 1616712 [details]
example catalog-operator log from 4.1.15 cluster with this problem

Comment 14 errata-xmlrpc 2019-09-25 07:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2820