1734606 – Cloud credential operator keeps crashing

Bug 1734606 - Cloud credential operator keeps crashing

Summary: Cloud credential operator keeps crashing

Keywords:
Status:	CLOSED DUPLICATE of bug 1723569
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Devan Goodwin
QA Contact:	Oleg Nesterov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-31 04:19 UTC by Miheer Salunke
Modified:	2019-08-02 15:02 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-02 15:02:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Miheer Salunke 2019-07-31 04:19:20 UTC

Description of problem:

The pod keeps crash looping, restarting for a while, then crashing. I restarted an hour ago and it has restart 27 times. Prior to that I saw that it restarted 1178 times!

OOMKiller killed the pod. So I increased the pod max memory limit from 500Mi to 1Gi (via the operator deployment). The system seems to like this for now. But why would this pod need so much memory? Metrics is showing that it is pegged at 1Gi as well so I'm not sure that OOM won't get it at some point.

The pod is crashing again even at 1Gi! Here are the events:
#oc4 get events
LAST SEEN TYPE REASON OBJECT MESSAGE
58s Normal Pulled pod/cloud-credential-operator-768d9c6f46-fw758 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba803653c4af0089feca91b065d1d96ad4660ae061cf40793894d24585891f3c" already present on machine
58s Warning Failed pod/cloud-credential-operator-768d9c6f46-fw758 Error: container create failed: container_linux.go:329: creating new parent process caused "container_linux.go:1762: running lstat on namespace path \"/proc/47771/ns/ipc\" caused \"lstat /proc/47771/ns/ipc: no such file or directory\""
1s Warning BackOff pod/cloud-credential-operator-768d9c6f46-fw758 Back-off restarting failed containe
The original container was OOMKilled.

Version-Release number of selected component (if applicable):
4.1.z

How reproducible:
Happens consistently. It runs for a while before crashing again.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 2 Miheer Salunke 2019-07-31 04:21:16 UTC

The deployment/pod originally had the new limits (150Mi/500Mi Limit) on it - so the patch was applied to the operator deployment before I did anything.  I increased the hard limit on the deployment to 1Gi and it has been running since with 0 restarts.  Any idea why this operator needs so much memory?

Comment 3 Devan Goodwin 2019-08-02 12:06:05 UTC

How many namespaces and secrets are in this cluster? 

oc get secrets -A | wc -l
oc get namespaces -A | wc -l

We have seen this in both situations thus the removal of the memory limit.

The operator watches both resource types so we can react immediately if credentials are deleted or namespaces created which need credentials. The kube client code caches everything being watched, and thus if you have thousands of one of these resources it can cause excessive memory usage. We saw this in one cluster where an operator had gone rogue and there were 30k secrets created.

Comment 4 Jim Rigsbee 2019-08-02 12:28:40 UTC

jrmini:aws jrigsbee$ oc4 get namespaces -A | wc -l
      96
jrmini:aws jrigsbee$ oc4 get secrets -A | wc -l
   27265

Wow!  That's a lot of secrets.

Comment 5 Jim Rigsbee 2019-08-02 12:36:19 UTC

Looks like namespace openshift-cluster-node-tuning-operator, secret name = tuned-dockercfg-XXXXX is the culprit:

oc4 get secrets -A -n openshift-cluster-node-tuning-operator | wc -l
   27268

Comment 7 Devan Goodwin 2019-08-02 15:02:02 UTC

Looks like the root of your problem is https://bugzilla.redhat.com/show_bug.cgi?id=1723569, if it's ok I'm going to close this as a duplicate as we have already shipped a fix for CCO to remove memory limits and avoid this if possible. 

That fix will not affect clusters on upgrade because the CVO does not reconcile memory limits, so any existing cluster hitting this should be able to remove the memory limits from the cloud credential operator deployment by hand.

*** This bug has been marked as a duplicate of bug 1723569 ***

Note You need to log in before you can comment on or make changes to this bug.