1711402 – CCO repeatedly OOMKilled in cluster with 2350 projects

Bug 1711402 - CCO repeatedly OOMKilled in cluster with 2350 projects

Summary: CCO repeatedly OOMKilled in cluster with 2350 projects

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Devan Goodwin
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-41
Depends On:
Blocks:	1723892
TreeView+	depends on / blocked

Reported:	2019-05-17 17:14 UTC by Mike Fiedler
Modified:	2019-10-16 06:29 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Memory limit on pod. Consequence: Credential Operator could crash on clusters with large numbers of projects/namespaces. Fix: Remove the memory limit. Result: Operator no longer crashes and memory is handled by the cluster itself.
Clone Of:
Clones:	1723892 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:29:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cloud-credential-operator memory usage (21.31 KB, image/png) 2019-05-17 17:25 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:29:17 UTC

Description Mike Fiedler 2019-05-17 17:14:16 UTC

Description of problem:

While creating 2000+ projects for a large data upgrade test, cloud-credential-operator started crash looping due to OOMKills

openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   1/1   Running   7     4h50m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   OOMKilled   7     4h51m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   CrashLoopBackOff   7     4h51m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   1/1   Running   8     4h56m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   OOMKilled   8     4h56m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   CrashLoopBackOff   8     4h56m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   1/1   Running   9     5h1m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   OOMKilled   9     5h1m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   CrashLoopBackOff   9     5h1m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   1/1   Running   10    5h6m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   Error   10    5h6m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   CrashLoopBackOff   10    5h6m


Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-05-16-223922


How reproducible: Unknown, once so far.   I can try deleting projects and recreating if needed.


Steps to Reproduce:
1. Install AWS IPI cluster with xlarge workers (instead of large)
2. Create 2350 projects with 2 deployments, 0 pods,  6 builds, 1 image stream, 3 routes and 20 secrets in each

Actual results:

cloud-credential-operator starts crash looping due to OOMKills


Additional info:

attached tarball contains
1. CCO pod log
2. oc describe of the master node where CCO is running
3. oc adm must-gather output

Comment 2 Devan Goodwin 2019-05-17 17:24:24 UTC

We do monitor namespaces due to problems with the order things are created, credentials requests live in our namespace but target secrets in component namespaces. We watch namespaces so we can act as soon as one is created to get the cred secret ready for that component asap. 

Very likely this is just running into the current 500mb limit. We can drop the limit entirely or bump it up.

Is this a 4.1 blocker?

Comment 3 Mike Fiedler 2019-05-17 17:24:34 UTC

Correction, the project spec in the description was for the first 2000 projects.   The remaining 300 projects have 1 deployment with 1 pod.   Looking at the attached Grafana graph of CCO memory usage, it was while adding these final 300 projects that the OOMKilling started

Comment 4 Mike Fiedler 2019-05-17 17:25:34 UTC

Created attachment 1570208 [details]
cloud-credential-operator memory usage

Comment 7 Devan Goodwin 2019-05-21 17:33:52 UTC

https://github.com/openshift/cloud-credential-operator/pull/68 has merged.

Comment 8 Mike Fiedler 2019-05-22 14:16:06 UTC

This was fixed on master/4.2.0

Comment 13 Mike Fiedler 2019-06-07 02:26:38 UTC

This was never backported to 4.1.z.   Changing the target for this to 4.2.  If we do a 4.1.z backport, please clone

Comment 14 Devan Goodwin 2019-06-20 17:29:38 UTC

Can we get this verified against 4.2? The problem has surfaced in starter so we need to get it into a .z release.

Comment 15 Mike Fiedler 2019-06-20 21:08:48 UTC

Test running now.  Normally verify is done on nightly builds, that was the hold up, but will report back result on CI build here in a few hours

Comment 16 Mike Fiedler 2019-06-21 01:45:33 UTC

Verified on CI build 4.2.0-0.ci-2019-06-20-163251.  Created 4000 projects with 72K+ secrets and drove CCO memory over 650MB with no oom kills or restarts of the operator pod.

Comment 18 errata-xmlrpc 2019-10-16 06:29:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.