Description of problem: While creating 2000+ projects for a large data upgrade test, cloud-credential-operator started crash looping due to OOMKills openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 1/1 Running 7 4h50m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 0/1 OOMKilled 7 4h51m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 0/1 CrashLoopBackOff 7 4h51m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 1/1 Running 8 4h56m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 0/1 OOMKilled 8 4h56m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 0/1 CrashLoopBackOff 8 4h56m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 1/1 Running 9 5h1m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 0/1 OOMKilled 9 5h1m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 0/1 CrashLoopBackOff 9 5h1m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 1/1 Running 10 5h6m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 0/1 Error 10 5h6m openshift-cloud-credential-operator cloud-credential-operator-56587487fd-ql5l6 0/1 CrashLoopBackOff 10 5h6m Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-05-16-223922 How reproducible: Unknown, once so far. I can try deleting projects and recreating if needed. Steps to Reproduce: 1. Install AWS IPI cluster with xlarge workers (instead of large) 2. Create 2350 projects with 2 deployments, 0 pods, 6 builds, 1 image stream, 3 routes and 20 secrets in each Actual results: cloud-credential-operator starts crash looping due to OOMKills Additional info: attached tarball contains 1. CCO pod log 2. oc describe of the master node where CCO is running 3. oc adm must-gather output
We do monitor namespaces due to problems with the order things are created, credentials requests live in our namespace but target secrets in component namespaces. We watch namespaces so we can act as soon as one is created to get the cred secret ready for that component asap. Very likely this is just running into the current 500mb limit. We can drop the limit entirely or bump it up. Is this a 4.1 blocker?
Correction, the project spec in the description was for the first 2000 projects. The remaining 300 projects have 1 deployment with 1 pod. Looking at the attached Grafana graph of CCO memory usage, it was while adding these final 300 projects that the OOMKilling started
Created attachment 1570208 [details] cloud-credential-operator memory usage
https://github.com/openshift/cloud-credential-operator/pull/68 has merged.
This was fixed on master/4.2.0
This was never backported to 4.1.z. Changing the target for this to 4.2. If we do a 4.1.z backport, please clone
Can we get this verified against 4.2? The problem has surfaced in starter so we need to get it into a .z release.
Test running now. Normally verify is done on nightly builds, that was the hold up, but will report back result on CI build here in a few hours
Verified on CI build 4.2.0-0.ci-2019-06-20-163251. Created 4000 projects with 72K+ secrets and drove CCO memory over 650MB with no oom kills or restarts of the operator pod.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922