1706635 – [ci] OOMKill during CI serial run, possible failure to set resource limits?

Bug 1706635 - [ci] OOMKill during CI serial run, possible failure to set resource limits?

Summary: [ci] OOMKill during CI serial run, possible failure to set resource limits?

Keywords:
Status:	CLOSED DUPLICATE of bug 1706625
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-05 20:15 UTC by Clayton Coleman
Modified:	2019-05-06 15:39 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-06 15:39:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Memory consumption of all namespaces in CI run (255.17 KB, image/png) 2019-05-06 13:30 UTC, Matthias Loibl	no flags	Details
Memory consumption of Prometheus Adapter (208.47 KB, image/png) 2019-05-06 13:35 UTC, Matthias Loibl	no flags	Details
View All

Description Clayton Coleman 2019-05-05 20:15:45 UTC

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/579

Failed with:

May 04 07:50:05.661 E ns/openshift-monitoring pod/prometheus-adapter-787cdbc799-ffwsf node/ip-10-0-138-142.ec2.internal container=prometheus-adapter container exited with code 2 (OOMKilled): 
May 04 07:50:38.033 E ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-140-157.ec2.internal container=prometheus container exited with code 1 (Error): 
May 04 08:35:11.579 E ns/openshift-machine-config-operator pod/machine-config-daemon-48f8q node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 
May 04 08:35:41.631 E ns/openshift-image-registry pod/node-ca-jvlvh node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): 
May 04 08:43:57.404 E ns/openshift-machine-config-operator pod/machine-config-daemon-htt22 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 
May 04 08:51:23.617 E ns/openshift-machine-config-operator pod/machine-config-daemon-tfzkt node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 
May 04 08:51:53.667 E ns/openshift-image-registry pod/node-ca-hq8gj node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): 
May 04 08:54:43.935 E ns/openshift-machine-config-operator pod/machine-config-daemon-fvqb6 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 
May 04 08:55:13.982 E ns/openshift-image-registry pod/node-ca-cljnc node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): 
May 04 08:57:15.544 E ns/openshift-machine-config-operator pod/machine-config-daemon-sxwv6 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 

Could be a failure to set resource limits, please work with all three impacted teams (monitoring, registry, mcd) to verify their limits are in place or debug.

Comment 1 lserven 2019-05-06 11:13:08 UTC

Hm I'm a little confused by this. The recent guidance we got from several Red Hatter's was to remove all Pod limits and keep only resource requests. e.g.: https://github.com/openshift/cluster-monitoring-operator/pull/219. Is this no longer the case?

Comment 3 Matthias Loibl 2019-05-06 13:29:27 UTC

Prometheus Adapter OOMing is quite strange.
I've checked the cluster-monitoring-operator and it doesn't force any limits anymore. Additionally I've looked into the Prometheus Adapter logs and don't see anything either.

Looking at the Prometheus WAL from the CI run, I can clearly see that the openshift-machine-config-operator namespace is allocating lots of memory. Around 10GiB at the end.
At the same time memory consumption for the Prometheus Adapter stays around 25MiB.

Matthias

Comment 4 Matthias Loibl 2019-05-06 13:30:24 UTC

Created attachment 1564470 [details]
Memory consumption of all namespaces in CI run

Comment 5 Matthias Loibl 2019-05-06 13:35:46 UTC

Created attachment 1564472 [details]
Memory consumption of Prometheus Adapter

Note You need to log in before you can comment on or make changes to this bug.