Bug 1734606
| Summary: | Cloud credential operator keeps crashing | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Miheer Salunke <misalunk> |
| Component: | Cloud Credential Operator | Assignee: | Devan Goodwin <dgoodwin> |
| Status: | CLOSED DUPLICATE | QA Contact: | Oleg Nesterov <olnester> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.1.z | CC: | jrigsbee |
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-08-02 15:02:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Miheer Salunke
2019-07-31 04:19:20 UTC
The deployment/pod originally had the new limits (150Mi/500Mi Limit) on it - so the patch was applied to the operator deployment before I did anything. I increased the hard limit on the deployment to 1Gi and it has been running since with 0 restarts. Any idea why this operator needs so much memory? How many namespaces and secrets are in this cluster? oc get secrets -A | wc -l oc get namespaces -A | wc -l We have seen this in both situations thus the removal of the memory limit. The operator watches both resource types so we can react immediately if credentials are deleted or namespaces created which need credentials. The kube client code caches everything being watched, and thus if you have thousands of one of these resources it can cause excessive memory usage. We saw this in one cluster where an operator had gone rogue and there were 30k secrets created. jrmini:aws jrigsbee$ oc4 get namespaces -A | wc -l
96
jrmini:aws jrigsbee$ oc4 get secrets -A | wc -l
27265
Wow! That's a lot of secrets.
Looks like namespace openshift-cluster-node-tuning-operator, secret name = tuned-dockercfg-XXXXX is the culprit: oc4 get secrets -A -n openshift-cluster-node-tuning-operator | wc -l 27268 Looks like the root of your problem is https://bugzilla.redhat.com/show_bug.cgi?id=1723569, if it's ok I'm going to close this as a duplicate as we have already shipped a fix for CCO to remove memory limits and avoid this if possible. That fix will not affect clusters on upgrade because the CVO does not reconcile memory limits, so any existing cluster hitting this should be able to remove the memory limits from the cloud credential operator deployment by hand. *** This bug has been marked as a duplicate of bug 1723569 *** |