On a CI cluster I observed revision-pruner being OOMKilled. After looking at it had no resource requests. All pods created by infra components must have resource requests. name: revision-pruner-7-ip-10-0-154-247.ec2.internal namespace: openshift-kube-scheduler containerID: cri-o://d4cbb0c7217e37b15adf1cb317259a55efb855784d5abe9189ee66ba99b17b95 exitCode: 0 finishedAt: "2020-01-17T16:56:35Z" reason: OOMKilled startedAt: "2020-01-17T16:56:35Z" Please review all places where these pods are created and ensure all of them have resource requests.
Switching back to assigned, this won't actually be fixed until library-go bumps go through
I used the following prometheus query on a different cluster to see the memory usage of another revision pruner that was OOMKilled: sum(container_memory_usage_bytes{namespace="openshift-kube-scheduler", pod="revision-pruner-7-ip-10-0-148-186.ec2.internal"}) by (container_name) That gave me only 4898816 bytes, which is about 4M. Much lower than the currently requested 100M. Could it be possible that it's being oomkill because it's requesting too much resources?
I posted this in the current PR, but I think it's worth noting here for reference: To follow up with this, I did some investigating on a longer-running cluster. In the operator I looked at (KSO), there was only 1 revision pruner pod (out of 8) that was OOMKilled. That pod was only using 40M at its peak (with request & limit set to 100M). So I'm wondering if increasing memory requests will actually fix this, or if in some cases our revision pruner is just the unlucky pod that got dropped on an otherwise full cluster. We can give it another bump, but it isn't essential that no revision pruners ever get OOMKilled (because the next one will clean up any work from the previous one to get killed). And we don't want to request excessive memory to push out other components that are actually critical. This may just be a case where we're chasing the wrong flake
Another follow up, this PR (and linked bug): https://github.com/openshift/library-go/pull/707 indicate that CPU limits are also required to guarantee a pod. Because that merged, I'm going to set this back to ON_QA so we can see if that fixed our problem. If not we can try one more resource limit bump
Sending this bug to Node per conversation with David and Eric (cc'd). It seems to be related to the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1800609, where pods are getting killed for possibly a reason that is not accurately OOMKill.