Bug 1706635 - [ci] OOMKill during CI serial run, possible failure to set resource limits?
Summary: [ci] OOMKill during CI serial run, possible failure to set resource limits?
Keywords:
Status: CLOSED DUPLICATE of bug 1706625
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-05 20:15 UTC by Clayton Coleman
Modified: 2019-05-06 15:39 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-06 15:39:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Memory consumption of all namespaces in CI run (255.17 KB, image/png)
2019-05-06 13:30 UTC, Matthias Loibl
no flags Details
Memory consumption of Prometheus Adapter (208.47 KB, image/png)
2019-05-06 13:35 UTC, Matthias Loibl
no flags Details

Description Clayton Coleman 2019-05-05 20:15:45 UTC
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/579

Failed with:

May 04 07:50:05.661 E ns/openshift-monitoring pod/prometheus-adapter-787cdbc799-ffwsf node/ip-10-0-138-142.ec2.internal container=prometheus-adapter container exited with code 2 (OOMKilled): 
May 04 07:50:38.033 E ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-140-157.ec2.internal container=prometheus container exited with code 1 (Error): 
May 04 08:35:11.579 E ns/openshift-machine-config-operator pod/machine-config-daemon-48f8q node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 
May 04 08:35:41.631 E ns/openshift-image-registry pod/node-ca-jvlvh node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): 
May 04 08:43:57.404 E ns/openshift-machine-config-operator pod/machine-config-daemon-htt22 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 
May 04 08:51:23.617 E ns/openshift-machine-config-operator pod/machine-config-daemon-tfzkt node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 
May 04 08:51:53.667 E ns/openshift-image-registry pod/node-ca-hq8gj node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): 
May 04 08:54:43.935 E ns/openshift-machine-config-operator pod/machine-config-daemon-fvqb6 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 
May 04 08:55:13.982 E ns/openshift-image-registry pod/node-ca-cljnc node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): 
May 04 08:57:15.544 E ns/openshift-machine-config-operator pod/machine-config-daemon-sxwv6 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): 

Could be a failure to set resource limits, please work with all three impacted teams (monitoring, registry, mcd) to verify their limits are in place or debug.

Comment 1 lserven 2019-05-06 11:13:08 UTC
Hm I'm a little confused by this. The recent guidance we got from several Red Hatter's was to remove all Pod limits and keep only resource requests. e.g.: https://github.com/openshift/cluster-monitoring-operator/pull/219. Is this no longer the case?

Comment 3 Matthias Loibl 2019-05-06 13:29:27 UTC
Prometheus Adapter OOMing is quite strange.
I've checked the cluster-monitoring-operator and it doesn't force any limits anymore. Additionally I've looked into the Prometheus Adapter logs and don't see anything either.

Looking at the Prometheus WAL from the CI run, I can clearly see that the openshift-machine-config-operator namespace is allocating lots of memory. Around 10GiB at the end.
At the same time memory consumption for the Prometheus Adapter stays around 25MiB.

Matthias

Comment 4 Matthias Loibl 2019-05-06 13:30:24 UTC
Created attachment 1564470 [details]
Memory consumption of all namespaces in CI run

Comment 5 Matthias Loibl 2019-05-06 13:35:46 UTC
Created attachment 1564472 [details]
Memory consumption of Prometheus Adapter


Note You need to log in before you can comment on or make changes to this bug.