Bug 1711402

Summary: CCO repeatedly OOMKilled in cluster with 2350 projects
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: Cloud Credential OperatorAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: decarr, dgoodwin, sponnaga
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-41
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Memory limit on pod. Consequence: Credential Operator could crash on clusters with large numbers of projects/namespaces. Fix: Remove the memory limit. Result: Operator no longer crashes and memory is handled by the cluster itself.
Story Points: ---
Clone Of:
: 1723892 (view as bug list) Environment:
Last Closed: 2019-10-16 06:29:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1723892    
Attachments:
Description Flags
cloud-credential-operator memory usage none

Description Mike Fiedler 2019-05-17 17:14:16 UTC
Description of problem:

While creating 2000+ projects for a large data upgrade test, cloud-credential-operator started crash looping due to OOMKills

openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   1/1   Running   7     4h50m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   OOMKilled   7     4h51m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   CrashLoopBackOff   7     4h51m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   1/1   Running   8     4h56m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   OOMKilled   8     4h56m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   CrashLoopBackOff   8     4h56m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   1/1   Running   9     5h1m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   OOMKilled   9     5h1m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   CrashLoopBackOff   9     5h1m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   1/1   Running   10    5h6m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   Error   10    5h6m
openshift-cloud-credential-operator   cloud-credential-operator-56587487fd-ql5l6   0/1   CrashLoopBackOff   10    5h6m


Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-05-16-223922


How reproducible: Unknown, once so far.   I can try deleting projects and recreating if needed.


Steps to Reproduce:
1. Install AWS IPI cluster with xlarge workers (instead of large)
2. Create 2350 projects with 2 deployments, 0 pods,  6 builds, 1 image stream, 3 routes and 20 secrets in each

Actual results:

cloud-credential-operator starts crash looping due to OOMKills


Additional info:

attached tarball contains
1. CCO pod log
2. oc describe of the master node where CCO is running
3. oc adm must-gather output

Comment 2 Devan Goodwin 2019-05-17 17:24:24 UTC
We do monitor namespaces due to problems with the order things are created, credentials requests live in our namespace but target secrets in component namespaces. We watch namespaces so we can act as soon as one is created to get the cred secret ready for that component asap. 

Very likely this is just running into the current 500mb limit. We can drop the limit entirely or bump it up.

Is this a 4.1 blocker?

Comment 3 Mike Fiedler 2019-05-17 17:24:34 UTC
Correction, the project spec in the description was for the first 2000 projects.   The remaining 300 projects have 1 deployment with 1 pod.   Looking at the attached Grafana graph of CCO memory usage, it was while adding these final 300 projects that the OOMKilling started

Comment 4 Mike Fiedler 2019-05-17 17:25:34 UTC
Created attachment 1570208 [details]
cloud-credential-operator memory usage

Comment 7 Devan Goodwin 2019-05-21 17:33:52 UTC
https://github.com/openshift/cloud-credential-operator/pull/68 has merged.

Comment 8 Mike Fiedler 2019-05-22 14:16:06 UTC
This was fixed on master/4.2.0

Comment 13 Mike Fiedler 2019-06-07 02:26:38 UTC
This was never backported to 4.1.z.   Changing the target for this to 4.2.  If we do a 4.1.z backport, please clone

Comment 14 Devan Goodwin 2019-06-20 17:29:38 UTC
Can we get this verified against 4.2? The problem has surfaced in starter so we need to get it into a .z release.

Comment 15 Mike Fiedler 2019-06-20 21:08:48 UTC
Test running now.  Normally verify is done on nightly builds, that was the hold up, but will report back result on CI build here in a few hours

Comment 16 Mike Fiedler 2019-06-21 01:45:33 UTC
Verified on CI build 4.2.0-0.ci-2019-06-20-163251.  Created 4000 projects with 72K+ secrets and drove CCO memory over 650MB with no oom kills or restarts of the operator pod.

Comment 18 errata-xmlrpc 2019-10-16 06:29:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922