Bug 1765315 - High CPU usage in prometheus, thanos-querier while single user sits on dashboard
Summary: High CPU usage in prometheus, thanos-querier while single user sits on dashboard
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.3.0
Assignee: Paul Gier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-24 19:34 UTC by Clayton Coleman
Modified: 2020-01-23 11:09 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:09:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
so far i cannot unfortunately reproduce this. i get a figure as attached. (113.81 KB, image/png)
2019-10-25 07:32 UTC, Sergiusz Urbaniak
no flags Details
Query/CPU Metrics from Prometheus dump. (255.07 KB, image/png)
2019-10-25 10:53 UTC, Bartek Plotka
no flags Details
Query/CPU Metrics from 4.3.0-0.ci-2019-10-25-031503 (319.75 KB, image/png)
2019-10-25 11:12 UTC, Bartek Plotka
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift console pull 3142 0 'None' closed Bug 1765315: update container_name and pod_name metric labels 2021-02-10 02:36:45 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:09:35 UTC

Description Clayton Coleman 2019-10-24 19:34:59 UTC
Prometheus was using 2x2.5 cores and Thanos Querier was using at least 2x1 core on a cluster running e2e, with a single user logged in to the console sitting on the dashboard page (which runs certain queries relatively frequently).  The kubeapiserver, which was previously our highest CPU consumer in 4.2, was only using 3x0.75 core.  So we're using 5x the CPU to gather/query metrics for one person during e2e than we do for the core control plane which was by far our highest metric.

Needs investigation, expect CPU use of monitoring stack during e2e run to be below 2cores total (ideally below 1core).

Recreate:

1. Launch 4.3 cluster
2. Browse to console dashboard as admin
3. Observe CPU use of top 5 pods (reported by prometheus!) in lower right

https://www.dropbox.com/s/8iw6wca8nh6o1ch/Screenshot%202019-10-23%2014.28.37.png?dl=0

Comment 1 Sergiusz Urbaniak 2019-10-25 07:32:31 UTC
Created attachment 1629054 [details]
so far i cannot unfortunately reproduce this. i get a figure as attached.

Comment 2 Sergiusz Urbaniak 2019-10-25 07:34:29 UTC
@bartek: do you mind to look into this? Here is a link to the prometheus dump for a recent e2e run: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/522/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1517/artifacts/e2e-aws/metrics/

Comment 3 Bartek Plotka 2019-10-25 10:53:21 UTC
Created attachment 1629108 [details]
Query/CPU Metrics from Prometheus dump.

Comment 4 Bartek Plotka 2019-10-25 10:56:12 UTC
Hi!

So I tried the provided dump, but I think we might miss some data (attached in comment above). This dump covers only 30m and there was no thanos-querier, no CPU traffic as well that was hitting Prometheus at that moment. Maybe you wanted to copy some blocks as well? (Only WAL files present)

Comment 5 Bartek Plotka 2019-10-25 11:12:21 UTC
Created attachment 1629122 [details]
Query/CPU Metrics from 4.3.0-0.ci-2019-10-25-031503

We tried to reproduce it in a new ephemeral cluster and indeed dashboards even for a single user generate a bit of traffic (~15 requests per second from Querier to Prometheus), however, it really does not use much of the CPU.

I am not familiar with this console top CPU spend, wonder what queries we do there? Or again, maybe the dump was not at the moment of the high traffic.

I looked through the configuration and we don't specify query limits, so default applies:
For Prometheus: storage.remote.read-concurrent-limit=10  
For Querier: query.max-concurrent = 20

So overall there is definitely a rate limitation on CPU spent overall - we might tweak it though if needed.

Just from our Thanos project standpoint:
We can definitely explore adding new limits to Querier for further rate limitation at some point, plus either deploying query cache or building the simple cache inside Querier, but it's a tradeoff which requires some RAM memory.

Anyway, ensuring this bug is somehow caused by high query traffic issue would be nice. Also the kind of queries matters e.g fetching thousands of series for single query is doable, but this would also result in high Querier Memory consumption.

NOTE: This is the adhoc queries I did:
<querier-url or Prometheus UI URL>/graph?g0.range_input=6h&g0.expr=sum(rate(prometheus_http_requests_total%7B%7D%5B1m%5D))%20by(handler%2C%20code)&g0.tab=0&g1.range_input=6h&g1.expr=%20sum(rate(container_cpu_user_seconds_total%7Bpod%3D~%22prometheus.*%22%7D%5B5m%5D))%20by%20(instance%2C%20pod%2C%20container)&g1.tab=0&g2.range_input=6h&g2.expr=%20sum(rate(container_cpu_user_seconds_total%7Bpod%3D~%22thanos.*%22%7D%5B5m%5D))%20by%20(instance%2C%20pod%2C%20container)&g2.tab=0

Comment 6 Bartek Plotka 2019-10-25 11:22:19 UTC
Additionally, I can see we disabled compaction on local Prometheus instances. Is it intentional? That can cause some slowdowns and issues for longer queries against both Prometheus and Qeurier, especially if our retention is 15d here.

Comment 7 Bartek Plotka 2019-10-28 10:24:04 UTC
Created this to track compaction option change: https://bugzilla.redhat.com/show_bug.cgi?id=1766111

Comment 8 Bartek Plotka 2019-10-29 15:06:20 UTC
After digging more, it looks like the query we use in console: `topk(20, sort_desc(sum(rate(container_cpu_user_seconds_total{container_name="", pod!=""}[5m])) by (pod,namespace)))` is not showing correct results. From first glance it looks like it doubles the results for each pod. This is because of cadvisor changes. (`container_name` -> `container`). We should probably update this to something like:

`topk(20, sort_desc(sum(rate(container_cpu_user_seconds_total{container="",container_name="", pod!=""}[5m])) by (pod,namespace)))` or just removed `container_name` totally.

Comment 13 errata-xmlrpc 2020-01-23 11:09:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.