Bug 1734606 - Cloud credential operator keeps crashing
Summary: Cloud credential operator keeps crashing
Status: CLOSED DUPLICATE of bug 1723569
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.2.0
Assignee: Devan Goodwin
QA Contact: Oleg Nesterov
Depends On:
TreeView+ depends on / blocked
Reported: 2019-07-31 04:19 UTC by Miheer Salunke
Modified: 2019-08-02 15:02 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-08-02 15:02:02 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Miheer Salunke 2019-07-31 04:19:20 UTC
Description of problem:

The pod keeps crash looping, restarting for a while, then crashing.  I restarted an hour ago and it has restart 27 times.  Prior to that I saw that it restarted 1178 times!

 OOMKiller killed the pod. So I increased the pod max memory limit from 500Mi to 1Gi (via the operator deployment).  The system seems to like this for now.  But why would this pod need so much memory?  Metrics is showing that it is pegged at 1Gi as well so I'm not sure that OOM won't get it at some point.

The pod is crashing again even at 1Gi!  Here are the events:
#oc4 get events
LAST SEEN   TYPE      REASON    OBJECT                                           MESSAGE
58s         Normal    Pulled    pod/cloud-credential-operator-768d9c6f46-fw758   Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba803653c4af0089feca91b065d1d96ad4660ae061cf40793894d24585891f3c" already present on machine
58s         Warning   Failed    pod/cloud-credential-operator-768d9c6f46-fw758   Error: container create failed: container_linux.go:329: creating new parent process caused "container_linux.go:1762: running lstat on namespace path \"/proc/47771/ns/ipc\" caused \"lstat /proc/47771/ns/ipc: no such file or directory\""
1s          Warning   BackOff   pod/cloud-credential-operator-768d9c6f46-fw758   Back-off restarting failed containe
The original container was OOMKilled.

Version-Release number of selected component (if applicable):

How reproducible:
Happens consistently. It runs for a while before crashing again.

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 2 Miheer Salunke 2019-07-31 04:21:16 UTC
The deployment/pod originally had the new limits (150Mi/500Mi Limit) on it - so the patch was applied to the operator deployment before I did anything.  I increased the hard limit on the deployment to 1Gi and it has been running since with 0 restarts.  Any idea why this operator needs so much memory?

Comment 3 Devan Goodwin 2019-08-02 12:06:05 UTC
How many namespaces and secrets are in this cluster? 

oc get secrets -A | wc -l
oc get namespaces -A | wc -l

We have seen this in both situations thus the removal of the memory limit.

The operator watches both resource types so we can react immediately if credentials are deleted or namespaces created which need credentials. The kube client code caches everything being watched, and thus if you have thousands of one of these resources it can cause excessive memory usage. We saw this in one cluster where an operator had gone rogue and there were 30k secrets created.

Comment 4 Jim Rigsbee 2019-08-02 12:28:40 UTC
jrmini:aws jrigsbee$ oc4 get namespaces -A | wc -l
jrmini:aws jrigsbee$ oc4 get secrets -A | wc -l

Wow!  That's a lot of secrets.

Comment 5 Jim Rigsbee 2019-08-02 12:36:19 UTC
Looks like namespace openshift-cluster-node-tuning-operator, secret name = tuned-dockercfg-XXXXX is the culprit:

oc4 get secrets -A -n openshift-cluster-node-tuning-operator | wc -l

Comment 7 Devan Goodwin 2019-08-02 15:02:02 UTC
Looks like the root of your problem is https://bugzilla.redhat.com/show_bug.cgi?id=1723569, if it's ok I'm going to close this as a duplicate as we have already shipped a fix for CCO to remove memory limits and avoid this if possible. 

That fix will not affect clusters on upgrade because the CVO does not reconcile memory limits, so any existing cluster hitting this should be able to remove the memory limits from the cloud credential operator deployment by hand.

*** This bug has been marked as a duplicate of bug 1723569 ***

Note You need to log in before you can comment on or make changes to this bug.