Description of problem:I brought up a gcp cluster with ovn networking. Seems to work. However I got the following: oc get po --all-namespaces openshift-kube-controller-manager installer-6-pcamer-tc6vh-m-2.c.openshift-gce-devel.internal 0/1 OOMKilled I have not brought up many gcp clusters recently and have only seen this once. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
cosmetic issue, with no impact to the cluster because all failed installer pods are retried and later revisions will leave the old failed installer pod present. The node in question had no record of OOMkilling in dmesg. Prometheus had no record of usage. We did notice the pod's qos was burstable. We'll add CPU limits to get to guaranteed qos.
It is very odd that the kubelet would report the reason as OOM but journal/dmesg don't actually show any OOM. Can the kubelet mis-report the reason as oom when it wasn't actually an OOM?
We have experienced a very similar problem but: - with kube-apiserver - bare metal and for memory not cpu. - hosts had 256 GiB RAM but still Pods were killed with code 137 + message "OOMKilled" - it was reproducible (~ 7 attempts to deploy) "limits" helped us, But now wondering wasn't problem-related to something different? I don't know maybe some bug in requests/limits? How could such situation occurs if there is no problem with resources? To fix my bare-metal deployment I waited for bootstrap to generate /assets are generated, copy files to hosts and then I modified manifests, e.g. sed -i 's%{"requests":{"cpu":"150m","memory":"1Gi"}}%{"requests":{"cpu":"300m","memory":"2Gi"},"limits":{"memory":"20Gi"}}%' /etc/kubernetes/manifests/kube-apiserver-pod.yaml it helped to deploy, but later I couldn't persistently apply it (openshift-kube-apiserver-operator AFAIK has definition compiled into the binary and then it creates configmap/kube-apiserver-pod in openshift-kube-apiserver namespace) There were no OOMKill after deployment but still maybe we could have some flexible way to apply it, like configmap/kube-apiserver-pod for operator maybe? or we have some way and I haven't found it?
the OOM kill wasn't killed by an oomkiller
The installer needs an RHCOS bump, PR here: https://github.com/openshift/installer/pull/3173
*** Bug 1792501 has been marked as a duplicate of this bug. ***
This bug went MODIFIED when library-go#707 landed, but to actually modify the release image that change needs to be vendored into operators that are referenced from the release image. Moving back to ASSIGNED until we get links to those vendor-bump PRs.
*** Bug 1834927 has been marked as a duplicate of this bug. ***
Client-go updated https://github.com/openshift/origin/commit/7c09da5e0059873e32b5e9b8f209d4315c3766d5
I am trying to backport the changes from https://github.com/openshift/library-go/pull/707 (which references this BZ) to 4.3 and 4.2, but require this dependent to target 4.4.z. I apologize if this is not the right bug for that, but I see this bz currently targets no release so I am setting it to 4.4.z to enable the backports to go through.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-06-21-210301 True False 31m Cluster version is 4.4.0-0.nightly-2020-06-21-210301 $ ns="openshift-kube-scheduler" $ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1) $ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources { "limits": { "cpu": "150m", "memory": "100M" }, "requests": { "cpu": "150m", "memory": "100M" } } $ ns="openshift-kube-scheduler" $ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1) $ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources { "limits": { "cpu": "150m", "memory": "100M" }, "requests": { "cpu": "150m", "memory": "100M" } } $ ns="openshift-kube-apiserver" $ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1) $ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources { "limits": { "cpu": "150m", "memory": "100M" }, "requests": { "cpu": "150m", "memory": "100M" } } $ ns="openshift-kube-apiserver" $ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1) $ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources { "limits": { "cpu": "150m", "memory": "100M" }, "requests": { "cpu": "150m", "memory": "100M" } } $ ns="openshift-kube-controller-manager" $ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1) $ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources { "limits": { "cpu": "150m", "memory": "100M" }, "requests": { "cpu": "150m", "memory": "100M" } } $ ns="openshift-kube-controller-manager" $ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1) $ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources { "limits": { "cpu": "150m", "memory": "100M" }, "requests": { "cpu": "150m", "memory": "100M" } } $ oc get pod -A | grep -E -v 'Running|Completed' NAMESPACE NAME READY STATUS RESTARTS AGE
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2713
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days