Hide Forgot
Description of problem: On clusters without operators installed I see the marketplace CatalogSources are consuming each 1.5G RSS as reported by container_memory_rss metric. There are 3 of these pods, total consumption is over 4.5G. On startup the pods do not use much, 64M, but grow steadily over 8 hours to cap above 1.5G each. Note this is seen to consume above 3G each on some clusters. In addition, the "installed operators" pods are consuming similar memory when operators are installed. One example I have is consuming 8.8G RSS total across the following containers in openshift-marketplace: - certified-operators - community-operators - installed-redhat-openshift-logging - installed-openshift-operators - redhat-operators Version-Release number of selected component (if applicable): 4.1.9 How reproducible: Seems to be happening on any long lived clusters that have been upgraded over the last month. A fresh cluster installed today does not show more than 500M used. Steps to Reproduce: 1. Install OCP 2. Let cluster age or upgrade it? Unsure what is causing the max memory consumed to be higher on these older clusters. Actual results: Heavy memory consumption. Expected results: Low memory consumption. Additional info: Some long lived clusters show high memory consumption in the past but lower consumption since late last week. Memory consumption does not directly correspond w/ the cluster being upgraded to 4.1.9. I'll put additional details in comments for each of the scenarios I can document.
Cluster: cblecker-4x Env: stage Created: 6/5/2019 3:43:09 PM Current version: 4.1.9 History from ClusterVersion for latest version: - completionTime: "2019-08-07T17:51:15Z" image: quay.io/openshift-release-dev/ocp-release@sha256:27fd24c705d1107cc73cb7dda8257fe97900e130b68afc314d0ef0e31bcf9b8e startedTime: "2019-08-07T17:12:00Z" state: Completed verified: true version: 4.1.9 Query in screenshots: container_memory_rss{namespace="openshift-marketplace",container_name!="",container_name!="POD",container_name!="marketplace-operator"} Screenshots: cblecker-4x-2w.png - last 2 weeks of query cblecker-4x-upgrade-to-stable.png - from when the 4.1.9 upgrade completed until memory stabilized This shows the containers were growing post upgrade, were restarted a few times and eventually stabilized at a good memory consumption.
Created attachment 1603556 [details] cblecker-4x: last 2 weeks of metric
Created attachment 1603557 [details] cblecker-4x: from when the 4.1.9 upgrade completed until memory stabilized
Cluster: example with operators installed Env: production Created: 2019-07-16T22:03:14Z Current version: 4.1.9 History from ClusterVersion for latest version: - completionTime: "2019-08-08T20:48:40Z" image: quay.io/openshift-release-dev/ocp-release@sha256:27fd24c705d1107cc73cb7dda8257fe97900e130b68afc314d0ef0e31bcf9b8e startedTime: "2019-08-08T14:36:40Z" state: Completed verified: true version: 4.1.9 Query in screenshots: container_memory_rss{namespace="openshift-marketplace",container_name!="",container_name!="POD",container_name!="marketplace-operator"} Screenshots: cluster-with-operators-2w.png - last 2 weeks of query This shows the containers are growing post upgrade. In addition, the cluster was used to install operators via OperatorHub late last week. Each of the lines is a pod in openshift-marketplace namespace: - certified-operators - community-operators - installed-redhat-openshift-logging - installed-openshift-operators - redhat-operators
Created attachment 1603558 [details] cluster-with-operators: last 2 weeks of query
Hi, Naveen Thanks for your reporting this issue. We have a cluster which running one day, but we don't find this issue. We will keep an eye on it, thanks! mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.11 True False 16h Cluster version is 4.1.11 mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-129-68.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba ip-10-0-141-168.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba ip-10-0-153-224.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba ip-10-0-155-205.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba ip-10-0-164-116.us-east-2.compute.internal Ready worker 19h v1.13.4+d81afa6ba ip-10-0-174-123.us-east-2.compute.internal Ready master 19h v1.13.4+d81afa6ba mac:~ jianzhang$ oc adm top pods NAME CPU(cores) MEMORY(bytes) certified-operators-6bcdc96b-lzvd9 2m 22Mi community-operators-655bb9cd-h9fn7 2m 68Mi marketplace-operator-7df66dbf67-d7829 2m 14Mi redhat-operators-7c4b9f9f6f-b978p 3m 40Mi
We think that the grpc library backports we performed fix this in 4.1.15. Moving to modified for that reason.
Still seeing this issue on a cluster upgraded to 4.1.15.
Created attachment 1616712 [details] example catalog-operator log from 4.1.15 cluster with this problem
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2820