Bug 1792501

Summary: Scheduler revision pruner requests no resources, can get OOMKilled
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, calfonso, deads, eparis, jokerman, mdame, mfojtik, yinzhou
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-25 15:58:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1848584    

Description Clayton Coleman 2020-01-17 18:27:31 UTC
On a CI cluster I observed revision-pruner being OOMKilled. After looking at it had no resource requests.  All pods created by infra components must have resource requests.

  name: revision-pruner-7-ip-10-0-154-247.ec2.internal
  namespace: openshift-kube-scheduler

        containerID: cri-o://d4cbb0c7217e37b15adf1cb317259a55efb855784d5abe9189ee66ba99b17b95
        exitCode: 0
        finishedAt: "2020-01-17T16:56:35Z"
        reason: OOMKilled
        startedAt: "2020-01-17T16:56:35Z"

Please review all places where these pods are created and ensure all of them have resource requests.

Comment 1 Mike Dame 2020-01-27 21:32:22 UTC
Switching back to assigned, this won't actually be fixed until library-go bumps go through

Comment 5 Mike Dame 2020-02-05 17:16:55 UTC
I used the following prometheus query on a different cluster to see the memory usage of another revision pruner that was OOMKilled:

sum(container_memory_usage_bytes{namespace="openshift-kube-scheduler", pod="revision-pruner-7-ip-10-0-148-186.ec2.internal"}) by (container_name)

That gave me only 4898816 bytes, which is about 4M. Much lower than the currently requested 100M. Could it be possible that it's being oomkill because it's requesting too much resources?

Comment 6 Mike Dame 2020-02-10 16:33:33 UTC
I posted this in the current PR, but I think it's worth noting here for reference:

To follow up with this, I did some investigating on a longer-running cluster. In the operator I looked at (KSO), there was only 1 revision pruner pod (out of 8) that was OOMKilled. That pod was only using 40M at its peak (with request & limit set to 100M). So I'm wondering if increasing memory requests will actually fix this, or if in some cases our revision pruner is just the unlucky pod that got dropped on an otherwise full cluster.

We can give it another bump, but it isn't essential that no revision pruners ever get OOMKilled (because the next one will clean up any work from the previous one to get killed). And we don't want to request excessive memory to push out other components that are actually critical. This may just be a case where we're chasing the wrong flake

Comment 7 Mike Dame 2020-02-10 16:39:42 UTC
Another follow up, this PR (and linked bug): https://github.com/openshift/library-go/pull/707 indicate that CPU limits are also required to guarantee a pod. Because that merged, I'm going to set this back to ON_QA so we can see if that fixed our problem. If not we can try one more resource limit bump

Comment 8 Mike Dame 2020-02-11 14:33:55 UTC
Sending this bug to Node per conversation with David and Eric (cc'd). It seems to be related to the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1800609, where pods are getting killed for possibly a reason that is not accurately OOMKill.