https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/579 Failed with: May 04 07:50:05.661 E ns/openshift-monitoring pod/prometheus-adapter-787cdbc799-ffwsf node/ip-10-0-138-142.ec2.internal container=prometheus-adapter container exited with code 2 (OOMKilled): May 04 07:50:38.033 E ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-140-157.ec2.internal container=prometheus container exited with code 1 (Error): May 04 08:35:11.579 E ns/openshift-machine-config-operator pod/machine-config-daemon-48f8q node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): May 04 08:35:41.631 E ns/openshift-image-registry pod/node-ca-jvlvh node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): May 04 08:43:57.404 E ns/openshift-machine-config-operator pod/machine-config-daemon-htt22 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): May 04 08:51:23.617 E ns/openshift-machine-config-operator pod/machine-config-daemon-tfzkt node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): May 04 08:51:53.667 E ns/openshift-image-registry pod/node-ca-hq8gj node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): May 04 08:54:43.935 E ns/openshift-machine-config-operator pod/machine-config-daemon-fvqb6 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): May 04 08:55:13.982 E ns/openshift-image-registry pod/node-ca-cljnc node/ip-10-0-138-142.ec2.internal container=node-ca container exited with code 137 (Error): May 04 08:57:15.544 E ns/openshift-machine-config-operator pod/machine-config-daemon-sxwv6 node/ip-10-0-138-142.ec2.internal container=machine-config-daemon container exited with code 143 (Error): Could be a failure to set resource limits, please work with all three impacted teams (monitoring, registry, mcd) to verify their limits are in place or debug.
Hm I'm a little confused by this. The recent guidance we got from several Red Hatter's was to remove all Pod limits and keep only resource requests. e.g.: https://github.com/openshift/cluster-monitoring-operator/pull/219. Is this no longer the case?
Prometheus Adapter OOMing is quite strange. I've checked the cluster-monitoring-operator and it doesn't force any limits anymore. Additionally I've looked into the Prometheus Adapter logs and don't see anything either. Looking at the Prometheus WAL from the CI run, I can clearly see that the openshift-machine-config-operator namespace is allocating lots of memory. Around 10GiB at the end. At the same time memory consumption for the Prometheus Adapter stays around 25MiB. Matthias
Created attachment 1564470 [details] Memory consumption of all namespaces in CI run
Created attachment 1564472 [details] Memory consumption of Prometheus Adapter