1800609 – On gcp cluster containers getting OOMKilled

Bug 1800609 - On gcp cluster containers getting OOMKilled

Summary: On gcp cluster containers getting OOMKilled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1792501 1834927 (view as bug list)
Depends On:
Blocks:	1848583
TreeView+	depends on / blocked

Reported:	2020-02-07 14:24 UTC by Phil Cameron
Modified:	2023-10-06 19:09 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-29 15:33:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift library-go pull 707	None	closed	bug 1800609: add cpu limits to become guaranteed	2020-12-29 11:28:44 UTC
Red Hat Bugzilla	1714807	unspecified	CLOSED	static installer pod OOMkilled in 4.1.0-rc7	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2020:2713	None	None	None	2020-06-29 15:34:17 UTC

Internal Links: 1799079

Description Phil Cameron 2020-02-07 14:24:37 UTC

Description of problem:I brought up a gcp cluster with ovn networking. Seems to work. However I got the following: oc get po --all-namespaces  openshift-kube-controller-manager                       installer-6-pcamer-tc6vh-m-2.c.openshift-gce-devel.internal                0/1     OOMKilled

I have not brought up many gcp clusters recently and have only seen this once.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 David Eads 2020-02-07 15:24:14 UTC

cosmetic issue, with no impact to the cluster because all failed installer pods are retried and later revisions will leave the old failed installer pod present.  The node in question had no record of OOMkilling in dmesg.  Prometheus had no record of usage.

We did notice the pod's qos was burstable.  We'll add CPU limits to get to guaranteed qos.

Comment 2 Eric Paris 2020-02-07 15:26:06 UTC

It is very odd that the kubelet would report the reason as OOM but journal/dmesg don't actually show any OOM.

Can the kubelet mis-report the reason as oom when it wasn't actually an OOM?

Comment 4 Olimp Bockowski 2020-02-17 11:53:25 UTC

We have experienced a very similar problem but:
- with kube-apiserver
- bare metal and for memory not cpu. 
- hosts had 256 GiB RAM but still Pods were killed with code 137 + message "OOMKilled"
- it was reproducible (~ 7 attempts to deploy) 

"limits" helped us, 

But now wondering wasn't problem-related to something different? I don't know maybe some bug in requests/limits? How could such situation occurs if there is no problem with resources?

To fix my bare-metal deployment I waited for bootstrap to generate /assets are generated, copy files to hosts and then I modified manifests, e.g.
sed -i 's%{"requests":{"cpu":"150m","memory":"1Gi"}}%{"requests":{"cpu":"300m","memory":"2Gi"},"limits":{"memory":"20Gi"}}%' /etc/kubernetes/manifests/kube-apiserver-pod.yaml

it helped to deploy, but later I couldn't persistently apply it (openshift-kube-apiserver-operator AFAIK has definition compiled into the binary and then it creates configmap/kube-apiserver-pod in openshift-kube-apiserver namespace)
There were no OOMKill after deployment but still maybe we could have some flexible way to apply it, like configmap/kube-apiserver-pod for operator maybe? or we have some way and I haven't found it?

Comment 7 David Eads 2020-02-24 15:26:31 UTC

the OOM kill wasn't killed by an oomkiller

Comment 8 Ryan Phillips 2020-02-24 15:28:45 UTC

The installer needs an RHCOS bump, PR here: https://github.com/openshift/installer/pull/3173

Comment 9 Ryan Phillips 2020-02-25 15:58:57 UTC

*** Bug 1792501 has been marked as a duplicate of this bug. ***

Comment 10 W. Trevor King 2020-02-25 20:27:54 UTC

This bug went MODIFIED when library-go#707 landed, but to actually modify the release image that change needs to be vendored into operators that are referenced from the release image.  Moving back to ASSIGNED until we get links to those vendor-bump PRs.

Comment 15 Ryan Phillips 2020-05-13 16:28:42 UTC

*** Bug 1834927 has been marked as a duplicate of this bug. ***

Comment 16 Ryan Phillips 2020-05-13 16:41:41 UTC

Client-go updated https://github.com/openshift/origin/commit/7c09da5e0059873e32b5e9b8f209d4315c3766d5

Comment 17 Mike Dame 2020-06-18 15:12:07 UTC

I am trying to backport the changes from https://github.com/openshift/library-go/pull/707 (which references this BZ) to 4.3 and 4.2, but require this dependent to target 4.4.z.

I apologize if this is not the right bug for that, but I see this bz currently targets no release so I am setting it to 4.4.z to enable the backports to go through.

Comment 20 Sunil Choudhary 2020-06-22 07:34:19 UTC

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-06-21-210301   True        False         31m     Cluster version is 4.4.0-0.nightly-2020-06-21-210301

$ ns="openshift-kube-scheduler"
$ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-scheduler"
$ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-apiserver"
$ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-apiserver"
$ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-controller-manager"
$ podname=$(oc get pods -n $ns | grep installer | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ ns="openshift-kube-controller-manager"
$ podname=$(oc get pods -n $ns | grep revision-pruner | head -1 | cut -d " " -f1)
$ oc get pod -n $ns $podname -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "150m",
    "memory": "100M"
  },
  "requests": {
    "cpu": "150m",
    "memory": "100M"
  }
}

$ oc get pod -A | grep -E -v 'Running|Completed'
NAMESPACE                                               NAME                                                                  READY   STATUS      RESTARTS   AGE

Comment 22 errata-xmlrpc 2020-06-29 15:33:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2713

Comment 23 Red Hat Bugzilla 2023-09-14 05:52:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.