Bug 1700051

Summary: containers in prom pod are throttling triggering alert on console
Product: OpenShift Container Platform Reporter: Seth Jennings <sjenning>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: anpicker, erooth, mloibl, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
throttle.png
none
prom-throttle-graph.png none

Description Seth Jennings 2019-04-15 17:04:09 UTC
Created attachment 1555285 [details]
throttle.png

We need to remove limit blocks from the following containers in the prom pod

    name: prometheus-config-reloader
    resources:
      limits:
        cpu: 50m
        memory: 50Mi
      requests:
        cpu: 50m
        memory: 50Mi

    name: rules-configmap-reloader
    resources:
      limits:
        cpu: 25m
        memory: 10Mi
      requests:
        cpu: 25m
        memory: 10Mi

As it is, there is needless cpu throttling and likely needless OOM killing (theorized, not observed), especially with the 10Mi limit container.

Comment 1 Seth Jennings 2019-04-15 17:04:43 UTC
Created attachment 1555286 [details]
prom-throttle-graph.png

Comment 2 Frederic Branczyk 2019-04-16 11:51:46 UTC
Unfortunately these are hardcoded into the prometheus-operator as of right now. We're going to go with the following strategy:

* Patch the fork we ship in OpenShift to completely remove the limits.
* Upstream since has added flags to configure these requests/limits, in addition we will add that the 0 value will make the limit/request be removed entirely, and this is what we'll switch to in a future OpenShift version.

Comment 3 Frederic Branczyk 2019-04-16 12:43:39 UTC
The PR patch for our fork has been opened: https://github.com/openshift/prometheus-operator/pull/24

Comment 4 Frederic Branczyk 2019-04-16 14:24:37 UTC
And the PR to allow configuring this to be disabled on upstream: https://github.com/coreos/prometheus-operator/pull/2560

Comment 5 Frederic Branczyk 2019-04-17 16:02:53 UTC
The change that fixes this in the immediate situation is merged.

Comment 7 Junqi Zhao 2019-04-18 02:52:58 UTC
There is not available OCP payload which packages the fix to test, so postpone the testing until we have available payload

Comment 8 Junqi Zhao 2019-04-19 09:11:24 UTC
resource limits for alertmanager-main/prometheus-k8s statefulset are removed

payload
4.0.0-0.nightly-2019-04-18-190537

Comment 10 errata-xmlrpc 2019-06-04 10:47:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758