Bug 1800609 - On gcp cluster containers getting OOMKilled
Summary: On gcp cluster containers getting OOMKilled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.4.z
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1792501 1834927 (view as bug list)
Depends On:
Blocks: 1848583
TreeView+ depends on / blocked
 
Reported: 2020-02-07 14:24 UTC by Phil Cameron
Modified: 2023-10-06 19:09 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-29 15:33:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift library-go pull 707 0 None closed bug 1800609: add cpu limits to become guaranteed 2020-12-29 11:28:44 UTC
Red Hat Bugzilla 1714807 0 unspecified CLOSED static installer pod OOMkilled in 4.1.0-rc7 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2020:2713 0 None None None 2020-06-29 15:34:17 UTC

Internal Links: 1799079

Description Phil Cameron 2020-02-07 14:24:37 UTC
Description of problem:I brought up a gcp cluster with ovn networking. Seems to work. However I got the following: oc get po --all-namespaces  openshift-kube-controller-manager                       installer-6-pcamer-tc6vh-m-2.c.openshift-gce-devel.internal                0/1     OOMKilled

I have not brought up many gcp clusters recently and have only seen this once.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 David Eads 2020-02-07 15:24:14 UTC
cosmetic issue, with no impact to the cluster because all failed installer pods are retried and later revisions will leave the old failed installer pod present.  The node in question had no record of OOMkilling in dmesg.  Prometheus had no record of usage.

We did notice the pod's qos was burstable.  We'll add CPU limits to get to guaranteed qos.

Comment 2 Eric Paris 2020-02-07 15:26:06 UTC
It is very odd that the kubelet would report the reason as OOM but journal/dmesg don't actually show any OOM.

Can the kubelet mis-report the reason as oom when it wasn't actually an OOM?

Comment 4 Olimp Bockowski 2020-02-17 11:53:25 UTC
We have experienced a very similar problem but:
- with kube-apiserver
- bare metal and for memory not cpu. 
- hosts had 256 GiB RAM but still Pods were killed with code 137 + message "OOMKilled"
- it was reproducible (~ 7 attempts to deploy) 

"limits" helped us, 

But now wondering wasn't problem-related to something different? I don't know maybe some bug in requests/limits? How could such situation occurs if there is no problem with resources?

To fix my bare-metal deployment I waited for bootstrap to generate /assets are generated, copy files to hosts and then I modified manifests, e.g.
sed -i 's%{"requests":{"cpu":"150m","memory":"1Gi"}}%{"requests":{"cpu":"300m","memory":"2Gi"},"limits":{"memory":"20Gi"}}%' /etc/kubernetes/manifests/kube-apiserver-pod.yaml

it helped to deploy, but later I couldn't persistently apply it (openshift-kube-apiserver-operator AFAIK has definition compiled into the binary and then it creates configmap/kube-apiserver-pod in openshift-kube-apiserver namespace)
There were no OOMKill after deployment but still maybe we could have some flexible way to apply it, like configmap/kube-apiserver-pod for operator maybe? or we have some way and I haven't found it?

Comment 7 David Eads 2020-02-24 15:26:31 UTC
the OOM kill wasn't killed by an oomkiller

Comment 8 Ryan Phillips 2020-02-24 15:28:45 UTC
The installer needs an RHCOS bump, PR here: https://github.com/openshift/installer/pull/3173

Comment 9 Ryan Phillips 2020-02-25 15:58:57 UTC
*** Bug 1792501 has been marked as a duplicate of this bug. ***

Comment 10 W. Trevor King 2020-02-25 20:27:54 UTC
This bug went MODIFIED when library-go#707 landed, but to actually modify the release image that change needs to be vendored into operators that are referenced from the release image.  Moving back to ASSIGNED until we get links to those vendor-bump PRs.

Comment 15 Ryan Phillips 2020-05-13 16:28:42 UTC
*** Bug 1834927 has been marked as a duplicate of this bug. ***

Comment 17 Mike Dame 2020-06-18 15:12:07 UTC
I am trying to backport the changes from https://github.com/openshift/library-go/pull/707 (which references this BZ) to 4.3 and 4.2, but require this dependent to target 4.4.z.

I apologize if this is not the right bug for that, but I see this bz currently targets no release so I am setting it to 4.4.z to enable the backports to go through.

Comment 20 Sunil Choudhary 2020-06-22 07:34:19 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-06-21-210301   True        False         31m     Cluster version is 4.4.0-0.nightly-2020-06-21-210301

$ ns="openshift-kube-scheduler"
$ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-scheduler"
$ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-apiserver"
$ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-apiserver"
$ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-controller-manager"
$ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-controller-manager"
$ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ oc get pod -A | grep -E -v 'Running|Completed'
NAMESPACE                                               NAME                                                                  READY   STATUS      RESTARTS   AGE

Comment 22 errata-xmlrpc 2020-06-29 15:33:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2713

Comment 23 Red Hat Bugzilla 2023-09-14 05:52:05 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.