Bug 1700051

Summary:

containers in prom pod are throttling triggering alert on console

Product:

OpenShift Container Platform

Reporter:

Seth Jennings <sjenning>

Component:

Monitoring

Assignee:

Frederic Branczyk <fbranczy>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.1.0

CC:

anpicker, erooth, mloibl, pkrupa, surbania

Target Milestone:

---

Target Release:

4.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-04 10:47:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
throttle.png	none
prom-throttle-graph.png	none

Description Seth Jennings 2019-04-15 17:04:09 UTC

Created attachment 1555285 [details]
throttle.png

We need to remove limit blocks from the following containers in the prom pod

    name: prometheus-config-reloader
    resources:
      limits:
        cpu: 50m
        memory: 50Mi
      requests:
        cpu: 50m
        memory: 50Mi

    name: rules-configmap-reloader
    resources:
      limits:
        cpu: 25m
        memory: 10Mi
      requests:
        cpu: 25m
        memory: 10Mi

As it is, there is needless cpu throttling and likely needless OOM killing (theorized, not observed), especially with the 10Mi limit container.

Comment 1 Seth Jennings 2019-04-15 17:04:43 UTC

Created attachment 1555286 [details]
prom-throttle-graph.png

Comment 2 Frederic Branczyk 2019-04-16 11:51:46 UTC

Unfortunately these are hardcoded into the prometheus-operator as of right now. We're going to go with the following strategy:

* Patch the fork we ship in OpenShift to completely remove the limits.
* Upstream since has added flags to configure these requests/limits, in addition we will add that the 0 value will make the limit/request be removed entirely, and this is what we'll switch to in a future OpenShift version.

Comment 3 Frederic Branczyk 2019-04-16 12:43:39 UTC

The PR patch for our fork has been opened: https://github.com/openshift/prometheus-operator/pull/24

Comment 4 Frederic Branczyk 2019-04-16 14:24:37 UTC

And the PR to allow configuring this to be disabled on upstream: https://github.com/coreos/prometheus-operator/pull/2560

Comment 5 Frederic Branczyk 2019-04-17 16:02:53 UTC

The change that fixes this in the immediate situation is merged.

Comment 7 Junqi Zhao 2019-04-18 02:52:58 UTC

There is not available OCP payload which packages the fix to test, so postpone the testing until we have available payload

Comment 8 Junqi Zhao 2019-04-19 09:11:24 UTC

resource limits for alertmanager-main/prometheus-k8s statefulset are removed

payload
4.0.0-0.nightly-2019-04-18-190537

Comment 10 errata-xmlrpc 2019-06-04 10:47:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758