1792501 – Scheduler revision pruner requests no resources, can get OOMKilled

Bug 1792501 - Scheduler revision pruner requests no resources, can get OOMKilled

Summary: Scheduler revision pruner requests no resources, can get OOMKilled

Keywords:
Status:	CLOSED DUPLICATE of bug 1800609
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1848584
TreeView+	depends on / blocked

Reported:	2020-01-17 18:27 UTC by Clayton Coleman
Modified:	2020-06-18 14:53 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-25 15:58:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-scheduler-operator pull 200	None	closed	Bug 1792501: bump(*): library-go to pull in revision pruner pod resources	2020-11-18 14:23:28 UTC
Github	openshift library-go pull 683	None	closed	Bug 1792501: Add memory request and limit to revision pruner pod	2020-11-18 14:23:28 UTC
Github	openshift library-go pull 703	None	closed	Bug 1792501: Increase revision pruner pod memory request	2020-11-18 14:23:07 UTC

Internal Links: 1799079

Description Clayton Coleman 2020-01-17 18:27:31 UTC

On a CI cluster I observed revision-pruner being OOMKilled. After looking at it had no resource requests.  All pods created by infra components must have resource requests.

  name: revision-pruner-7-ip-10-0-154-247.ec2.internal
  namespace: openshift-kube-scheduler

        containerID: cri-o://d4cbb0c7217e37b15adf1cb317259a55efb855784d5abe9189ee66ba99b17b95
        exitCode: 0
        finishedAt: "2020-01-17T16:56:35Z"
        reason: OOMKilled
        startedAt: "2020-01-17T16:56:35Z"

Please review all places where these pods are created and ensure all of them have resource requests.

Comment 1 Mike Dame 2020-01-27 21:32:22 UTC

Switching back to assigned, this won't actually be fixed until library-go bumps go through

Comment 5 Mike Dame 2020-02-05 17:16:55 UTC

I used the following prometheus query on a different cluster to see the memory usage of another revision pruner that was OOMKilled:

sum(container_memory_usage_bytes{namespace="openshift-kube-scheduler", pod="revision-pruner-7-ip-10-0-148-186.ec2.internal"}) by (container_name)

That gave me only 4898816 bytes, which is about 4M. Much lower than the currently requested 100M. Could it be possible that it's being oomkill because it's requesting too much resources?

Comment 6 Mike Dame 2020-02-10 16:33:33 UTC

I posted this in the current PR, but I think it's worth noting here for reference:

To follow up with this, I did some investigating on a longer-running cluster. In the operator I looked at (KSO), there was only 1 revision pruner pod (out of 8) that was OOMKilled. That pod was only using 40M at its peak (with request & limit set to 100M). So I'm wondering if increasing memory requests will actually fix this, or if in some cases our revision pruner is just the unlucky pod that got dropped on an otherwise full cluster.

We can give it another bump, but it isn't essential that no revision pruners ever get OOMKilled (because the next one will clean up any work from the previous one to get killed). And we don't want to request excessive memory to push out other components that are actually critical. This may just be a case where we're chasing the wrong flake

Comment 7 Mike Dame 2020-02-10 16:39:42 UTC

Another follow up, this PR (and linked bug): https://github.com/openshift/library-go/pull/707 indicate that CPU limits are also required to guarantee a pod. Because that merged, I'm going to set this back to ON_QA so we can see if that fixed our problem. If not we can try one more resource limit bump

Comment 8 Mike Dame 2020-02-11 14:33:55 UTC

Sending this bug to Node per conversation with David and Eric (cc'd). It seems to be related to the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1800609, where pods are getting killed for possibly a reason that is not accurately OOMKill.

Note You need to log in before you can comment on or make changes to this bug.