Bug 1792501
Summary: | Scheduler revision pruner requests no resources, can get OOMKilled | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Node | Assignee: | Ryan Phillips <rphillips> |
Status: | CLOSED DUPLICATE | QA Contact: | Sunil Choudhary <schoudha> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4 | CC: | aos-bugs, calfonso, deads, eparis, jokerman, mdame, mfojtik, yinzhou |
Target Milestone: | --- | ||
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-02-25 15:58:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1848584 |
Description
Clayton Coleman
2020-01-17 18:27:31 UTC
Switching back to assigned, this won't actually be fixed until library-go bumps go through I used the following prometheus query on a different cluster to see the memory usage of another revision pruner that was OOMKilled: sum(container_memory_usage_bytes{namespace="openshift-kube-scheduler", pod="revision-pruner-7-ip-10-0-148-186.ec2.internal"}) by (container_name) That gave me only 4898816 bytes, which is about 4M. Much lower than the currently requested 100M. Could it be possible that it's being oomkill because it's requesting too much resources? I posted this in the current PR, but I think it's worth noting here for reference: To follow up with this, I did some investigating on a longer-running cluster. In the operator I looked at (KSO), there was only 1 revision pruner pod (out of 8) that was OOMKilled. That pod was only using 40M at its peak (with request & limit set to 100M). So I'm wondering if increasing memory requests will actually fix this, or if in some cases our revision pruner is just the unlucky pod that got dropped on an otherwise full cluster. We can give it another bump, but it isn't essential that no revision pruners ever get OOMKilled (because the next one will clean up any work from the previous one to get killed). And we don't want to request excessive memory to push out other components that are actually critical. This may just be a case where we're chasing the wrong flake Another follow up, this PR (and linked bug): https://github.com/openshift/library-go/pull/707 indicate that CPU limits are also required to guarantee a pod. Because that merged, I'm going to set this back to ON_QA so we can see if that fixed our problem. If not we can try one more resource limit bump Sending this bug to Node per conversation with David and Eric (cc'd). It seems to be related to the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1800609, where pods are getting killed for possibly a reason that is not accurately OOMKill. |