Description of problem:
The pod keeps crash looping, restarting for a while, then crashing. I restarted an hour ago and it has restart 27 times. Prior to that I saw that it restarted 1178 times!
OOMKiller killed the pod. So I increased the pod max memory limit from 500Mi to 1Gi (via the operator deployment). The system seems to like this for now. But why would this pod need so much memory? Metrics is showing that it is pegged at 1Gi as well so I'm not sure that OOM won't get it at some point.
The pod is crashing again even at 1Gi! Here are the events:
#oc4 get events
LAST SEEN TYPE REASON OBJECT MESSAGE
58s Normal Pulled pod/cloud-credential-operator-768d9c6f46-fw758 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba803653c4af0089feca91b065d1d96ad4660ae061cf40793894d24585891f3c" already present on machine
58s Warning Failed pod/cloud-credential-operator-768d9c6f46-fw758 Error: container create failed: container_linux.go:329: creating new parent process caused "container_linux.go:1762: running lstat on namespace path \"/proc/47771/ns/ipc\" caused \"lstat /proc/47771/ns/ipc: no such file or directory\""
1s Warning BackOff pod/cloud-credential-operator-768d9c6f46-fw758 Back-off restarting failed containe
The original container was OOMKilled.
Version-Release number of selected component (if applicable):
Happens consistently. It runs for a while before crashing again.
Steps to Reproduce:
The deployment/pod originally had the new limits (150Mi/500Mi Limit) on it - so the patch was applied to the operator deployment before I did anything. I increased the hard limit on the deployment to 1Gi and it has been running since with 0 restarts. Any idea why this operator needs so much memory?
How many namespaces and secrets are in this cluster?
oc get secrets -A | wc -l
oc get namespaces -A | wc -l
We have seen this in both situations thus the removal of the memory limit.
The operator watches both resource types so we can react immediately if credentials are deleted or namespaces created which need credentials. The kube client code caches everything being watched, and thus if you have thousands of one of these resources it can cause excessive memory usage. We saw this in one cluster where an operator had gone rogue and there were 30k secrets created.
jrmini:aws jrigsbee$ oc4 get namespaces -A | wc -l
jrmini:aws jrigsbee$ oc4 get secrets -A | wc -l
Wow! That's a lot of secrets.
Looks like namespace openshift-cluster-node-tuning-operator, secret name = tuned-dockercfg-XXXXX is the culprit:
oc4 get secrets -A -n openshift-cluster-node-tuning-operator | wc -l
Looks like the root of your problem is https://bugzilla.redhat.com/show_bug.cgi?id=1723569, if it's ok I'm going to close this as a duplicate as we have already shipped a fix for CCO to remove memory limits and avoid this if possible.
That fix will not affect clusters on upgrade because the CVO does not reconcile memory limits, so any existing cluster hitting this should be able to remove the memory limits from the cloud credential operator deployment by hand.
*** This bug has been marked as a duplicate of bug 1723569 ***