1799079 – revision-pruner pod OOMKilled during installation

Bug 1799079 - revision-pruner pod OOMKilled during installation

Summary: revision-pruner pod OOMKilled during installation

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.2.z
Hardware:	s390x
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.2.z
Assignee:	Mike Dame
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1848584 (view as bug list)
Depends On:	1848583
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-06 15:27 UTC by Andy McCrae
Modified:	2020-07-27 19:04 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 16:49:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift library-go pull 822	None	closed	[release-4.2] Bug 1799079: Add resource requests to pruner and installer pod	2020-11-18 14:23:07 UTC
Red Hat Bugzilla	1792501	unspecified	CLOSED	Scheduler revision pruner requests no resources, can get OOMKilled	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1800609	low	CLOSED	On gcp cluster containers getting OOMKilled	2023-10-06 19:09:34 UTC

Description Andy McCrae 2020-02-06 15:27:04 UTC

Description of problem:

When deploying 4.2.16 on the mini clusters some of the revision-pruner containers are OOMKilled during installation. They do restart successfully.

Version-Release number of selected component (if applicable):
4.2.18/4.2.16

How reproducible:
Everytime I've done an install I've had at least 1 or 2.

Steps to Reproduce:
1. Run through the regular installation approach.
2. oc get pods --all-namespaces=true | grep -i oom

Actual results:

# oc get pods --all-namespaces=true | grep -i oom
openshift-kube-apiserver                                revision-pruner-3-master-2.test.example.com                       0/1     OOMKilled   0          

Expected results:

No pods are OOMKilled.

Additional info:
The logs from the failed pod are as follows:
# oc logs -f revision-pruner-3-master-2.test.example.com -n openshift-kube-apiserver
I0206 11:12:58.379630       1 cmd.go:38] &{<nil> true {false} prune true map[max-eligible-revision:0xc0006f5d60 protected-revisions:0xc0006f5e00 resource-dir:0xc0006f5ea0 static-pod-name:0xc0006f5f40 v:0xc0006f45a0] [0xc0006f45a0 0xc0006f5d60 0xc0006f5e00 0xc0006f5ea0 0xc0006f5f40] [] map[alsologtostderr:0xc0006f4000 help:0xc000704320 log-backtrace-at:0xc0006f40a0 log-dir:0xc0006f4140 log-file:0xc0006f41e0 log-file-max-size:0xc0006f4280 log-flush-frequency:0xc0000de280 logtostderr:0xc0006f4320 max-eligible-revision:0xc0006f5d60 protected-revisions:0xc0006f5e00 resource-dir:0xc0006f5ea0 skip-headers:0xc0006f43c0 skip-log-headers:0xc0006f4460 static-pod-name:0xc0006f5f40 stderrthreshold:0xc0006f4500 v:0xc0006f45a0 vmodule:0xc0006f4640] [0xc0006f5d60 0xc0006f5e00 0xc0006f5ea0 0xc0006f5f40 0xc0006f4000 0xc0006f40a0 0xc0006f4140 0xc0006f41e0 0xc0006f4280 0xc0000de280 0xc0006f4320 0xc0006f43c0 0xc0006f4460 0xc0006f4500 0xc0006f45a0 0xc0006f4640 0xc000704320] [0xc0006f4000 0xc000704320 0xc0006f40a0 0xc0006f4140 0xc0006f41e0 0xc0006f4280 0xc0000de280 0xc0006f4320 0xc0006f5d60 0xc0006f5e00 0xc0006f5ea0 0xc0006f43c0 0xc0006f4460 0xc0006f5f40 0xc0006f4500 0xc0006f45a0 0xc0006f4640] map[104:0xc000704320 118:0xc0006f45a0] [] -1 0 0xc0006f2cc0 true <nil> []}
I0206 11:12:58.379942       1 cmd.go:39] (*prune.PruneOptions)(0xc000097100)({
 MaxEligibleRevision: (int) 3,
 ProtectedRevisions: ([]int) (len=3 cap=3) {
  (int) 1,
  (int) 2,
  (int) 3
 },
 ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources",
 StaticPodName: (string) (len=18) "kube-apiserver-pod"
})

Comment 1 Andy McCrae 2020-02-06 15:28:24 UTC

The installations still succeed, so setting this as low prio.

Comment 3 Carvel Baus 2020-06-05 14:48:37 UTC

Assigning to kube-apiserver for further triage. Don't believe this is arch specific.

Comment 4 Mike Dame 2020-06-18 14:50:01 UTC

*** Bug 1848584 has been marked as a duplicate of this bug. ***

Comment 7 Ke Wang 2020-07-13 07:32:39 UTC

- Checked if the bug related PR already was merged in 4.2 branch.
$ cd library-go

$ git checkout -b 4.2 remotes/origin/release-4.2
Updating files: 100% (13716/13716), done.
Branch '4.2' set up to track remote branch 'release-4.2' from 'origin'.
Switched to a new branch '4.2'

$ git pull

$ git log -1
commit c27670c64634202911c0f9b3f946bf4658d27d71 (HEAD -> 4.2, origin/release-4.2)
Merge: d58edcb9 2cfb6b95
Author: OpenShift Merge Robot <openshift-merge-robot.github.com>
Date:   Tue Jun 30 21:06:53 2020 -0400

    Merge pull request #822 from damemi/4.2-backport-pruner-requests

The bug related PR has been merged in on Jun 30.

- Check if the PR #822 has been bumped to component kube-apiserver-operator of latest payload,
$ cd cluster-kube-apiserver-operator/

$ git pull

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2020-07-12-134436 | grep kube-apiserver
  cluster-kube-apiserver-operator               https://github.com/openshift/cluster-kube-apiserver-operator               925d5210ffb332e4da75459df7a66380a135bc3a

$ git log --date local --pretty="%h %an %cd - %s" 925d521 | grep '#822'

$ git log --date local --pretty="%h %an %cd - %s" 925d521 -1
925d5210 OpenShift Merge Robot Thu May 14 11:12:00 2020 - Merge pull request #724 from openshift-cherrypick-robot/cherry-pick-629-to-release-4.2

The last merged PR date is on May 14, the date is earlier than the date PR #822 merged in, so need to bump PR '#822' to kube-apiserer-operator component.

Comment 8 Mike Dame 2020-07-13 14:08:00 UTC

@Ke Wang thank you for checking, you are right that this still needs to be bumped in the operator. I opened https://github.com/openshift/cluster-kube-apiserver-operator/pull/900 to do that and am switching this back to assigned so the bug bot will pick it up. Thanks!

Note You need to log in before you can comment on or make changes to this bug.