Bug 1792501 - Scheduler revision pruner requests no resources, can get OOMKilled
Summary: Scheduler revision pruner requests no resources, can get OOMKilled
Keywords:
Status: CLOSED DUPLICATE of bug 1800609
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1848584
TreeView+ depends on / blocked
 
Reported: 2020-01-17 18:27 UTC by Clayton Coleman
Modified: 2020-06-18 14:53 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-25 15:58:57 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-kube-scheduler-operator pull 200 None closed Bug 1792501: bump(*): library-go to pull in revision pruner pod resources 2020-11-18 14:23:28 UTC
Github openshift library-go pull 683 None closed Bug 1792501: Add memory request and limit to revision pruner pod 2020-11-18 14:23:28 UTC
Github openshift library-go pull 703 None closed Bug 1792501: Increase revision pruner pod memory request 2020-11-18 14:23:07 UTC

Internal Links: 1799079

Description Clayton Coleman 2020-01-17 18:27:31 UTC
On a CI cluster I observed revision-pruner being OOMKilled. After looking at it had no resource requests.  All pods created by infra components must have resource requests.

  name: revision-pruner-7-ip-10-0-154-247.ec2.internal
  namespace: openshift-kube-scheduler

        containerID: cri-o://d4cbb0c7217e37b15adf1cb317259a55efb855784d5abe9189ee66ba99b17b95
        exitCode: 0
        finishedAt: "2020-01-17T16:56:35Z"
        reason: OOMKilled
        startedAt: "2020-01-17T16:56:35Z"

Please review all places where these pods are created and ensure all of them have resource requests.

Comment 1 Mike Dame 2020-01-27 21:32:22 UTC
Switching back to assigned, this won't actually be fixed until library-go bumps go through

Comment 5 Mike Dame 2020-02-05 17:16:55 UTC
I used the following prometheus query on a different cluster to see the memory usage of another revision pruner that was OOMKilled:

sum(container_memory_usage_bytes{namespace="openshift-kube-scheduler", pod="revision-pruner-7-ip-10-0-148-186.ec2.internal"}) by (container_name)

That gave me only 4898816 bytes, which is about 4M. Much lower than the currently requested 100M. Could it be possible that it's being oomkill because it's requesting too much resources?

Comment 6 Mike Dame 2020-02-10 16:33:33 UTC
I posted this in the current PR, but I think it's worth noting here for reference:

To follow up with this, I did some investigating on a longer-running cluster. In the operator I looked at (KSO), there was only 1 revision pruner pod (out of 8) that was OOMKilled. That pod was only using 40M at its peak (with request & limit set to 100M). So I'm wondering if increasing memory requests will actually fix this, or if in some cases our revision pruner is just the unlucky pod that got dropped on an otherwise full cluster.

We can give it another bump, but it isn't essential that no revision pruners ever get OOMKilled (because the next one will clean up any work from the previous one to get killed). And we don't want to request excessive memory to push out other components that are actually critical. This may just be a case where we're chasing the wrong flake

Comment 7 Mike Dame 2020-02-10 16:39:42 UTC
Another follow up, this PR (and linked bug): https://github.com/openshift/library-go/pull/707 indicate that CPU limits are also required to guarantee a pod. Because that merged, I'm going to set this back to ON_QA so we can see if that fixed our problem. If not we can try one more resource limit bump

Comment 8 Mike Dame 2020-02-11 14:33:55 UTC
Sending this bug to Node per conversation with David and Eric (cc'd). It seems to be related to the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1800609, where pods are getting killed for possibly a reason that is not accurately OOMKill.


Note You need to log in before you can comment on or make changes to this bug.