Bug 1711402
| Summary: | CCO repeatedly OOMKilled in cluster with 2350 projects | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
| Component: | Cloud Credential Operator | Assignee: | Devan Goodwin <dgoodwin> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.1.0 | CC: | decarr, dgoodwin, sponnaga | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.2.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | aos-scalability-41 | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: Memory limit on pod.
Consequence: Credential Operator could crash on clusters with large numbers of projects/namespaces.
Fix: Remove the memory limit.
Result: Operator no longer crashes and memory is handled by the cluster itself.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1723892 (view as bug list) | Environment: | |||||
| Last Closed: | 2019-10-16 06:29:06 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1723892 | ||||||
| Attachments: |
|
||||||
|
Description
Mike Fiedler
2019-05-17 17:14:16 UTC
We do monitor namespaces due to problems with the order things are created, credentials requests live in our namespace but target secrets in component namespaces. We watch namespaces so we can act as soon as one is created to get the cred secret ready for that component asap. Very likely this is just running into the current 500mb limit. We can drop the limit entirely or bump it up. Is this a 4.1 blocker? Correction, the project spec in the description was for the first 2000 projects. The remaining 300 projects have 1 deployment with 1 pod. Looking at the attached Grafana graph of CCO memory usage, it was while adding these final 300 projects that the OOMKilling started Created attachment 1570208 [details]
cloud-credential-operator memory usage
This was fixed on master/4.2.0 This was never backported to 4.1.z. Changing the target for this to 4.2. If we do a 4.1.z backport, please clone Can we get this verified against 4.2? The problem has surfaced in starter so we need to get it into a .z release. Test running now. Normally verify is done on nightly builds, that was the hold up, but will report back result on CI build here in a few hours Verified on CI build 4.2.0-0.ci-2019-06-20-163251. Created 4000 projects with 72K+ secrets and drove CCO memory over 650MB with no oom kills or restarts of the operator pod. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |