1700051 – containers in prom pod are throttling triggering alert on console

Bug 1700051 - containers in prom pod are throttling triggering alert on console

Summary: containers in prom pod are throttling triggering alert on console

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-15 17:04 UTC by Seth Jennings
Modified:	2019-06-04 10:47 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
throttle.png (44.33 KB, image/png) 2019-04-15 17:04 UTC, Seth Jennings	no flags	Details
prom-throttle-graph.png (70.14 KB, image/png) 2019-04-15 17:04 UTC, Seth Jennings	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:43 UTC

Description Seth Jennings 2019-04-15 17:04:09 UTC

Created attachment 1555285 [details]
throttle.png

We need to remove limit blocks from the following containers in the prom pod

    name: prometheus-config-reloader
    resources:
      limits:
        cpu: 50m
        memory: 50Mi
      requests:
        cpu: 50m
        memory: 50Mi

    name: rules-configmap-reloader
    resources:
      limits:
        cpu: 25m
        memory: 10Mi
      requests:
        cpu: 25m
        memory: 10Mi

As it is, there is needless cpu throttling and likely needless OOM killing (theorized, not observed), especially with the 10Mi limit container.

Comment 1 Seth Jennings 2019-04-15 17:04:43 UTC

Created attachment 1555286 [details]
prom-throttle-graph.png

Comment 2 Frederic Branczyk 2019-04-16 11:51:46 UTC

Unfortunately these are hardcoded into the prometheus-operator as of right now. We're going to go with the following strategy:

* Patch the fork we ship in OpenShift to completely remove the limits.
* Upstream since has added flags to configure these requests/limits, in addition we will add that the 0 value will make the limit/request be removed entirely, and this is what we'll switch to in a future OpenShift version.

Comment 3 Frederic Branczyk 2019-04-16 12:43:39 UTC

The PR patch for our fork has been opened: https://github.com/openshift/prometheus-operator/pull/24

Comment 4 Frederic Branczyk 2019-04-16 14:24:37 UTC

And the PR to allow configuring this to be disabled on upstream: https://github.com/coreos/prometheus-operator/pull/2560

Comment 5 Frederic Branczyk 2019-04-17 16:02:53 UTC

The change that fixes this in the immediate situation is merged.

Comment 7 Junqi Zhao 2019-04-18 02:52:58 UTC

There is not available OCP payload which packages the fix to test, so postpone the testing until we have available payload

Comment 8 Junqi Zhao 2019-04-19 09:11:24 UTC

resource limits for alertmanager-main/prometheus-k8s statefulset are removed

payload
4.0.0-0.nightly-2019-04-18-190537

Comment 10 errata-xmlrpc 2019-06-04 10:47:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.