Bug 1700051 - containers in prom pod are throttling triggering alert on console
Summary: containers in prom pod are throttling triggering alert on console
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.1.0
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-15 17:04 UTC by Seth Jennings
Modified: 2019-06-04 10:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:47:37 UTC
Target Upstream Version:


Attachments (Terms of Use)
throttle.png (44.33 KB, image/png)
2019-04-15 17:04 UTC, Seth Jennings
no flags Details
prom-throttle-graph.png (70.14 KB, image/png)
2019-04-15 17:04 UTC, Seth Jennings
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:47:43 UTC

Description Seth Jennings 2019-04-15 17:04:09 UTC
Created attachment 1555285 [details]
throttle.png

We need to remove limit blocks from the following containers in the prom pod

    name: prometheus-config-reloader
    resources:
      limits:
        cpu: 50m
        memory: 50Mi
      requests:
        cpu: 50m
        memory: 50Mi

    name: rules-configmap-reloader
    resources:
      limits:
        cpu: 25m
        memory: 10Mi
      requests:
        cpu: 25m
        memory: 10Mi

As it is, there is needless cpu throttling and likely needless OOM killing (theorized, not observed), especially with the 10Mi limit container.

Comment 1 Seth Jennings 2019-04-15 17:04:43 UTC
Created attachment 1555286 [details]
prom-throttle-graph.png

Comment 2 Frederic Branczyk 2019-04-16 11:51:46 UTC
Unfortunately these are hardcoded into the prometheus-operator as of right now. We're going to go with the following strategy:

* Patch the fork we ship in OpenShift to completely remove the limits.
* Upstream since has added flags to configure these requests/limits, in addition we will add that the 0 value will make the limit/request be removed entirely, and this is what we'll switch to in a future OpenShift version.

Comment 3 Frederic Branczyk 2019-04-16 12:43:39 UTC
The PR patch for our fork has been opened: https://github.com/openshift/prometheus-operator/pull/24

Comment 4 Frederic Branczyk 2019-04-16 14:24:37 UTC
And the PR to allow configuring this to be disabled on upstream: https://github.com/coreos/prometheus-operator/pull/2560

Comment 5 Frederic Branczyk 2019-04-17 16:02:53 UTC
The change that fixes this in the immediate situation is merged.

Comment 7 Junqi Zhao 2019-04-18 02:52:58 UTC
There is not available OCP payload which packages the fix to test, so postpone the testing until we have available payload

Comment 8 Junqi Zhao 2019-04-19 09:11:24 UTC
resource limits for alertmanager-main/prometheus-k8s statefulset are removed

payload
4.0.0-0.nightly-2019-04-18-190537

Comment 10 errata-xmlrpc 2019-06-04 10:47:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.