1765315 – High CPU usage in prometheus, thanos-querier while single user sits on dashboard

Bug 1765315 - High CPU usage in prometheus, thanos-querier while single user sits on dashboard

Summary: High CPU usage in prometheus, thanos-querier while single user sits on dashboard

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Paul Gier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-24 19:34 UTC by Clayton Coleman
Modified:	2020-01-23 11:09 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:09:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
so far i cannot unfortunately reproduce this. i get a figure as attached. (113.81 KB, image/png) 2019-10-25 07:32 UTC, Sergiusz Urbaniak	no flags	Details
Query/CPU Metrics from Prometheus dump. (255.07 KB, image/png) 2019-10-25 10:53 UTC, Bartek Plotka	no flags	Details
Query/CPU Metrics from 4.3.0-0.ci-2019-10-25-031503 (319.75 KB, image/png) 2019-10-25 11:12 UTC, Bartek Plotka	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift console pull 3142	0	'None'	closed	Bug 1765315: update container_name and pod_name metric labels	2021-02-10 02:36:45 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:09:35 UTC

Description Clayton Coleman 2019-10-24 19:34:59 UTC

Prometheus was using 2x2.5 cores and Thanos Querier was using at least 2x1 core on a cluster running e2e, with a single user logged in to the console sitting on the dashboard page (which runs certain queries relatively frequently).  The kubeapiserver, which was previously our highest CPU consumer in 4.2, was only using 3x0.75 core.  So we're using 5x the CPU to gather/query metrics for one person during e2e than we do for the core control plane which was by far our highest metric.

Needs investigation, expect CPU use of monitoring stack during e2e run to be below 2cores total (ideally below 1core).

Recreate:

1. Launch 4.3 cluster
2. Browse to console dashboard as admin
3. Observe CPU use of top 5 pods (reported by prometheus!) in lower right

https://www.dropbox.com/s/8iw6wca8nh6o1ch/Screenshot%202019-10-23%2014.28.37.png?dl=0

Comment 1 Sergiusz Urbaniak 2019-10-25 07:32:31 UTC

Created attachment 1629054 [details]
so far i cannot unfortunately reproduce this. i get a figure as attached.

Comment 2 Sergiusz Urbaniak 2019-10-25 07:34:29 UTC

@bartek: do you mind to look into this? Here is a link to the prometheus dump for a recent e2e run: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/522/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1517/artifacts/e2e-aws/metrics/

Comment 3 Bartek Plotka 2019-10-25 10:53:21 UTC

Created attachment 1629108 [details]
Query/CPU Metrics from Prometheus dump.

Comment 4 Bartek Plotka 2019-10-25 10:56:12 UTC

Hi!

So I tried the provided dump, but I think we might miss some data (attached in comment above). This dump covers only 30m and there was no thanos-querier, no CPU traffic as well that was hitting Prometheus at that moment. Maybe you wanted to copy some blocks as well? (Only WAL files present)

Comment 5 Bartek Plotka 2019-10-25 11:12:21 UTC

Created attachment 1629122 [details]
Query/CPU Metrics from 4.3.0-0.ci-2019-10-25-031503

We tried to reproduce it in a new ephemeral cluster and indeed dashboards even for a single user generate a bit of traffic (~15 requests per second from Querier to Prometheus), however, it really does not use much of the CPU.

I am not familiar with this console top CPU spend, wonder what queries we do there? Or again, maybe the dump was not at the moment of the high traffic.

I looked through the configuration and we don't specify query limits, so default applies:
For Prometheus: storage.remote.read-concurrent-limit=10  
For Querier: query.max-concurrent = 20

So overall there is definitely a rate limitation on CPU spent overall - we might tweak it though if needed.

Just from our Thanos project standpoint:
We can definitely explore adding new limits to Querier for further rate limitation at some point, plus either deploying query cache or building the simple cache inside Querier, but it's a tradeoff which requires some RAM memory.

Anyway, ensuring this bug is somehow caused by high query traffic issue would be nice. Also the kind of queries matters e.g fetching thousands of series for single query is doable, but this would also result in high Querier Memory consumption.

NOTE: This is the adhoc queries I did:
<querier-url or Prometheus UI URL>/graph?g0.range_input=6h&g0.expr=sum(rate(prometheus_http_requests_total%7B%7D%5B1m%5D))%20by(handler%2C%20code)&g0.tab=0&g1.range_input=6h&g1.expr=%20sum(rate(container_cpu_user_seconds_total%7Bpod%3D~%22prometheus.*%22%7D%5B5m%5D))%20by%20(instance%2C%20pod%2C%20container)&g1.tab=0&g2.range_input=6h&g2.expr=%20sum(rate(container_cpu_user_seconds_total%7Bpod%3D~%22thanos.*%22%7D%5B5m%5D))%20by%20(instance%2C%20pod%2C%20container)&g2.tab=0

Comment 6 Bartek Plotka 2019-10-25 11:22:19 UTC

Additionally, I can see we disabled compaction on local Prometheus instances. Is it intentional? That can cause some slowdowns and issues for longer queries against both Prometheus and Qeurier, especially if our retention is 15d here.

Comment 7 Bartek Plotka 2019-10-28 10:24:04 UTC

Created this to track compaction option change: https://bugzilla.redhat.com/show_bug.cgi?id=1766111

Comment 8 Bartek Plotka 2019-10-29 15:06:20 UTC

After digging more, it looks like the query we use in console: `topk(20, sort_desc(sum(rate(container_cpu_user_seconds_total{container_name="", pod!=""}[5m])) by (pod,namespace)))` is not showing correct results. From first glance it looks like it doubles the results for each pod. This is because of cadvisor changes. (`container_name` -> `container`). We should probably update this to something like:

`topk(20, sort_desc(sum(rate(container_cpu_user_seconds_total{container="",container_name="", pod!=""}[5m])) by (pod,namespace)))` or just removed `container_name` totally.

Comment 13 errata-xmlrpc 2020-01-23 11:09:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.