1740857 – catalog-operator consumes 11GB RSS

Bug 1740857 - catalog-operator consumes 11GB RSS

Summary: catalog-operator consumes 11GB RSS

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Evan Cordell
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1757924 (view as bug list)
Depends On:	1746199
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-13 18:00 UTC by Naveen Malik
Modified:	2019-11-21 09:18 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1746199 (view as bug list)
Environment:
Last Closed:	2019-11-21 09:17:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
*container_memory_rss{namespace="openshift-operator-lifecycle-manager",container_name!="",container_name!="POD",container_name!="marketplace-operator"}>4102410241024** (276.12 KB, image/png) 2019-08-13 18:00 UTC, Naveen Malik	no flags	Details
oc get clusterversion version -o yaml (3.87 KB, text/plain) 2019-08-14 15:45 UTC, Naveen Malik	no flags	Details
must-gather.partaa (18.93 MB, application/octet-stream) 2019-08-27 14:31 UTC, Eric Jones	no flags	Details
must-gather.partah (18.93 MB, application/octet-stream) 2019-08-27 14:31 UTC, Eric Jones	no flags	Details
must-gather.partao (18.93 MB, application/octet-stream) 2019-08-27 14:31 UTC, Eric Jones	no flags	Details
must-gather.partav (18.93 MB, application/octet-stream) 2019-08-27 14:32 UTC, Eric Jones	no flags	Details
must-gather.partbc (18.93 MB, application/octet-stream) 2019-08-27 14:32 UTC, Eric Jones	no flags	Details
must-gather.partab (18.93 MB, application/octet-stream) 2019-08-27 14:32 UTC, Eric Jones	no flags	Details
must-gather.partai (18.93 MB, application/octet-stream) 2019-08-27 14:32 UTC, Eric Jones	no flags	Details
must-gather.partap (18.93 MB, application/octet-stream) 2019-08-27 14:33 UTC, Eric Jones	no flags	Details
must-gather.partaw (18.93 MB, application/octet-stream) 2019-08-27 14:33 UTC, Eric Jones	no flags	Details
must-gather.partbd (18.93 MB, application/octet-stream) 2019-08-27 14:33 UTC, Eric Jones	no flags	Details
must-gather.partac (18.93 MB, application/octet-stream) 2019-08-27 14:33 UTC, Eric Jones	no flags	Details
must-gather.partaj (18.93 MB, application/octet-stream) 2019-08-27 14:34 UTC, Eric Jones	no flags	Details
must-gather.partaq (18.93 MB, application/octet-stream) 2019-08-27 14:34 UTC, Eric Jones	no flags	Details
must-gather.partax (18.93 MB, application/octet-stream) 2019-08-27 14:34 UTC, Eric Jones	no flags	Details
must-gather.partad (18.93 MB, application/octet-stream) 2019-08-27 14:34 UTC, Eric Jones	no flags	Details
must-gather.partak (18.93 MB, application/octet-stream) 2019-08-27 14:34 UTC, Eric Jones	no flags	Details
must-gather.partar (18.93 MB, application/octet-stream) 2019-08-27 14:35 UTC, Eric Jones	no flags	Details
must-gather.partay (18.93 MB, application/octet-stream) 2019-08-27 14:35 UTC, Eric Jones	no flags	Details
must-gather.partae (18.93 MB, application/octet-stream) 2019-08-27 14:35 UTC, Eric Jones	no flags	Details
must-gather.partal (18.93 MB, application/octet-stream) 2019-08-27 14:35 UTC, Eric Jones	no flags	Details
must-gather.partas (18.93 MB, application/octet-stream) 2019-08-27 14:36 UTC, Eric Jones	no flags	Details
must-gather.partaz (18.93 MB, application/octet-stream) 2019-08-27 14:36 UTC, Eric Jones	no flags	Details
must-gather.partaf (18.93 MB, application/octet-stream) 2019-08-27 14:36 UTC, Eric Jones	no flags	Details
must-gather.partam (18.93 MB, application/octet-stream) 2019-08-27 14:36 UTC, Eric Jones	no flags	Details
must-gather.partat (18.93 MB, application/octet-stream) 2019-08-27 14:37 UTC, Eric Jones	no flags	Details
must-gather.partba (18.93 MB, application/octet-stream) 2019-08-27 14:37 UTC, Eric Jones	no flags	Details
must-gather.partag (18.93 MB, application/octet-stream) 2019-08-27 14:37 UTC, Eric Jones	no flags	Details
must-gather.partan (18.93 MB, application/octet-stream) 2019-08-27 14:37 UTC, Eric Jones	no flags	Details
must-gather.partau (18.93 MB, application/octet-stream) 2019-08-27 14:38 UTC, Eric Jones	no flags	Details
must-gather.partbb (18.93 MB, application/octet-stream) 2019-08-27 14:38 UTC, Eric Jones	no flags	Details
Pod top output, per Brenton's request (25.81 KB, text/plain) 2019-08-28 13:27 UTC, Eric Jones	no flags	Details
container_memory_rss for OSD 4.1.14 prod cluster (247.32 KB, image/png) 2019-09-13 20:42 UTC, Naveen Malik	no flags	Details
clusterversion for OSD 4.1.14 prod cluster (4.04 KB, text/plain) 2019-09-13 20:44 UTC, Naveen Malik	no flags	Details
Show Obsolete (30) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	operator-framework operator-lifecycle-manager pull 1043	0	'None'	closed	Bug 1740857: Backport grpc sourcemanager to 4.1	2020-06-29 15:32:30 UTC
Red Hat Product Errata	RHBA-2019:3875	0	None	None	None	2019-11-21 09:18:03 UTC

Internal Links: 1760608

Description Naveen Malik 2019-08-13 18:00:13 UTC

Created attachment 1603513 [details]
container_memory_rss{namespace="openshift-operator-lifecycle-manager",container_name!="",container_name!="POD",container_name!="marketplace-operator"}>4*1024*1024*1024

Description of problem:
The "catalog-operator" pod on OSD v4.1 clusters show memory consumption above 11GB RSS over a period of time.  It takes about 12 hours to get to this point.  Latest metrics on a cluster that's not doing anything active shows consumption of about 11GB of RSS.  Screenshot will be attached.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.9     True        False         5d2h    Cluster version is 4.1.9



How reproducible:
Appears consistently a problem on at least 4 clusters, I have not confirmed everywhere

Steps to Reproduce:
1. Install cluster.
2. Wait.

Actual results:
catalog-operator consumes 11GB+ memory.

Expected results:
catalog-operator has small memory footprint.

Additional info:

Comment 3 Jian Zhang 2019-08-14 02:35:50 UTC

I have a cluster which running almost one day. But, I don't find this issue. I will keep an eye on it.

mac:~ jianzhang$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.11    True        False         16h     Cluster version is 4.1.11
mac:~ jianzhang$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-129-68.us-east-2.compute.internal    Ready    master   19h   v1.13.4+d81afa6ba
ip-10-0-141-168.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-153-224.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-155-205.us-east-2.compute.internal   Ready    master   19h   v1.13.4+d81afa6ba
ip-10-0-164-116.us-east-2.compute.internal   Ready    worker   19h   v1.13.4+d81afa6ba
ip-10-0-174-123.us-east-2.compute.internal   Ready    master   19h   v1.13.4+d81afa6ba
mac:~ jianzhang$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-64556ffff5-wq99z   1/1     Running   0          16h
olm-operator-6ff7dbf564-zvw92       1/1     Running   0          16h
olm-operators-85r4c                 1/1     Running   0          16h
packageserver-6fd666d6b9-gw9mm      1/1     Running   0          26m
packageserver-6fd666d6b9-mpdf4      1/1     Running   0          26m
mac:~ jianzhang$ oc adm top pod catalog-operator-64556ffff5-wq99z
NAME                                CPU(cores)   MEMORY(bytes)   
catalog-operator-64556ffff5-wq99z   1m           36Mi

Comment 5 Naveen Malik 2019-08-14 15:45:09 UTC

Created attachment 1603802 [details]
oc get clusterversion version -o yaml

ClusterVersion attached for context on cluster age and upgrades over time.

Comment 9 Eric Jones 2019-08-27 14:31:27 UTC

Created attachment 1608623 [details]
must-gather.partaa

Comment 10 Eric Jones 2019-08-27 14:31:40 UTC

Created attachment 1608624 [details]
must-gather.partah

Comment 11 Eric Jones 2019-08-27 14:31:54 UTC

Created attachment 1608625 [details]
must-gather.partao

Comment 12 Eric Jones 2019-08-27 14:32:08 UTC

Created attachment 1608626 [details]
must-gather.partav

Comment 13 Eric Jones 2019-08-27 14:32:23 UTC

Created attachment 1608627 [details]
must-gather.partbc

Comment 14 Eric Jones 2019-08-27 14:32:38 UTC

Created attachment 1608628 [details]
must-gather.partab

Comment 15 Eric Jones 2019-08-27 14:32:54 UTC

Created attachment 1608629 [details]
must-gather.partai

Comment 16 Eric Jones 2019-08-27 14:33:07 UTC

Created attachment 1608630 [details]
must-gather.partap

Comment 17 Eric Jones 2019-08-27 14:33:21 UTC

Created attachment 1608631 [details]
must-gather.partaw

Comment 18 Eric Jones 2019-08-27 14:33:34 UTC

Created attachment 1608633 [details]
must-gather.partbd

Comment 19 Eric Jones 2019-08-27 14:33:48 UTC

Created attachment 1608634 [details]
must-gather.partac

Comment 20 Eric Jones 2019-08-27 14:34:02 UTC

Created attachment 1608635 [details]
must-gather.partaj

Comment 21 Eric Jones 2019-08-27 14:34:16 UTC

Created attachment 1608636 [details]
must-gather.partaq

Comment 22 Eric Jones 2019-08-27 14:34:31 UTC

Created attachment 1608637 [details]
must-gather.partax

Comment 23 Eric Jones 2019-08-27 14:34:45 UTC

Created attachment 1608638 [details]
must-gather.partad

Comment 24 Eric Jones 2019-08-27 14:34:58 UTC

Created attachment 1608639 [details]
must-gather.partak

Comment 25 Eric Jones 2019-08-27 14:35:12 UTC

Created attachment 1608640 [details]
must-gather.partar

Comment 26 Eric Jones 2019-08-27 14:35:28 UTC

Created attachment 1608641 [details]
must-gather.partay

Comment 27 Eric Jones 2019-08-27 14:35:41 UTC

Created attachment 1608642 [details]
must-gather.partae

Comment 28 Eric Jones 2019-08-27 14:35:55 UTC

Created attachment 1608643 [details]
must-gather.partal

Comment 29 Eric Jones 2019-08-27 14:36:11 UTC

Created attachment 1608644 [details]
must-gather.partas

Comment 30 Eric Jones 2019-08-27 14:36:24 UTC

Created attachment 1608645 [details]
must-gather.partaz

Comment 31 Eric Jones 2019-08-27 14:36:38 UTC

Created attachment 1608646 [details]
must-gather.partaf

Comment 32 Eric Jones 2019-08-27 14:36:54 UTC

Created attachment 1608647 [details]
must-gather.partam

Comment 33 Eric Jones 2019-08-27 14:37:08 UTC

Created attachment 1608648 [details]
must-gather.partat

Comment 34 Eric Jones 2019-08-27 14:37:22 UTC

Created attachment 1608649 [details]
must-gather.partba

Comment 35 Eric Jones 2019-08-27 14:37:37 UTC

Created attachment 1608650 [details]
must-gather.partag

Comment 36 Eric Jones 2019-08-27 14:37:51 UTC

Created attachment 1608652 [details]
must-gather.partan

Comment 37 Eric Jones 2019-08-27 14:38:06 UTC

Created attachment 1608653 [details]
must-gather.partau

Comment 38 Eric Jones 2019-08-27 14:38:20 UTC

Created attachment 1608654 [details]
must-gather.partbb

Comment 39 Eric Jones 2019-08-28 13:27:26 UTC

Created attachment 1609008 [details]
Pod top output, per Brenton's request

Marking must-gather parts as obsolete as Brenton noted they do not contain the data needed to investigate.

Comment 42 Naveen Malik 2019-09-13 20:41:26 UTC

Seeing this still on an OSD 4.1.14 cluster.
On version 4.1.9 the cluster showed a steady state usage of memory for catalog-operator pod.  Since upgrading to 4.1.13 and now upgraded to 4.1.14 the catalog operator pod memory usage grows until it's OOMKilled.

I'll attach a graph for the last 2 weeks with the ClusterVersion for the cluster for upgrade history.

Evan, can you update on expectations for a fix?

Comment 43 Naveen Malik 2019-09-13 20:42:41 UTC

Created attachment 1614994 [details]
container_memory_rss for OSD 4.1.14 prod cluster

Comment 44 Naveen Malik 2019-09-13 20:44:14 UTC

Created attachment 1614996 [details]
clusterversion for OSD 4.1.14 prod cluster

Comment 45 Evan Cordell 2019-09-14 14:55:17 UTC

We believe that the source of the memory leak is an issue in the grpc libraries, and that those leaks get triggered very frequently on 4.1 due to the way we were managing grpc connections. 

We have already backported grpc library updates in 4.1.15 that should address the source of memory leak. In 4.2 we have refactored the way we use those libraries to reduce the chance of triggering the leaks in the first place.

If the issue still occurs on 4.1.15, the next step would be to backport the refactored grpc connection handling from 4.2. But in theory, 4.1.15 should be fixed, and we'd like to avoid backporting the refactor if possible.

I will move to modified since the grpc library backport should fix it.

Comment 50 Naveen Malik 2019-09-30 18:47:42 UTC

@Evan should the pod have requests and limits set?  We see nothing right now on 4.1.18 clusters.

Comment 51 Evan Cordell 2019-10-02 18:32:42 UTC

*** Bug 1757924 has been marked as a duplicate of this bug. ***

Comment 54 Naveen Malik 2019-11-06 14:08:44 UTC

Checked on all OSD v4 production clusters.  They're running 4.1.21 or greater.  Ran this query:

container_memory_rss{namespace="openshift-marketplace",container_name!="",container_name!="POD",container_name!="marketplace-operator"}/1024/1024/1024>0.1

And all clusters had empty results, meaning low memory consumption.

From the OSD point of view I am calling this verified.  Thanks!!

Comment 56 Mike Fiedler 2019-11-07 15:23:44 UTC

Marking verified on 4.1.21.   @nmalik, thanks for the assist with verification.

Comment 58 errata-xmlrpc 2019-11-21 09:17:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3875

Note You need to log in before you can comment on or make changes to this bug.