Bug 1765315

Summary:

High CPU usage in prometheus, thanos-querier while single user sits on dashboard

Product:

OpenShift Container Platform

Reporter:

Clayton Coleman <ccoleman>

Component:

Monitoring

Assignee:

Paul Gier <pgier>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.3.0

CC:

alegrand, anpicker, bplotka, erooth, juzhao, kakkoyun, lcosic, mloibl, pkrupa, surbania

Target Milestone:

---

Target Release:

4.3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-01-23 11:09:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
so far i cannot unfortunately reproduce this. i get a figure as attached.	none
Query/CPU Metrics from Prometheus dump.	none
Query/CPU Metrics from 4.3.0-0.ci-2019-10-25-031503	none

Description Clayton Coleman 2019-10-24 19:34:59 UTC

Prometheus was using 2x2.5 cores and Thanos Querier was using at least 2x1 core on a cluster running e2e, with a single user logged in to the console sitting on the dashboard page (which runs certain queries relatively frequently).  The kubeapiserver, which was previously our highest CPU consumer in 4.2, was only using 3x0.75 core.  So we're using 5x the CPU to gather/query metrics for one person during e2e than we do for the core control plane which was by far our highest metric.

Needs investigation, expect CPU use of monitoring stack during e2e run to be below 2cores total (ideally below 1core).

Recreate:

1. Launch 4.3 cluster
2. Browse to console dashboard as admin
3. Observe CPU use of top 5 pods (reported by prometheus!) in lower right

https://www.dropbox.com/s/8iw6wca8nh6o1ch/Screenshot%202019-10-23%2014.28.37.png?dl=0

Comment 1 Sergiusz Urbaniak 2019-10-25 07:32:31 UTC

Created attachment 1629054 [details]
so far i cannot unfortunately reproduce this. i get a figure as attached.

Comment 2 Sergiusz Urbaniak 2019-10-25 07:34:29 UTC

@bartek: do you mind to look into this? Here is a link to the prometheus dump for a recent e2e run: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/522/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1517/artifacts/e2e-aws/metrics/

Comment 3 Bartek Plotka 2019-10-25 10:53:21 UTC

Created attachment 1629108 [details]
Query/CPU Metrics from Prometheus dump.

Comment 4 Bartek Plotka 2019-10-25 10:56:12 UTC

Hi!

So I tried the provided dump, but I think we might miss some data (attached in comment above). This dump covers only 30m and there was no thanos-querier, no CPU traffic as well that was hitting Prometheus at that moment. Maybe you wanted to copy some blocks as well? (Only WAL files present)

Comment 5 Bartek Plotka 2019-10-25 11:12:21 UTC

Created attachment 1629122 [details]
Query/CPU Metrics from 4.3.0-0.ci-2019-10-25-031503

We tried to reproduce it in a new ephemeral cluster and indeed dashboards even for a single user generate a bit of traffic (~15 requests per second from Querier to Prometheus), however, it really does not use much of the CPU.

I am not familiar with this console top CPU spend, wonder what queries we do there? Or again, maybe the dump was not at the moment of the high traffic.

I looked through the configuration and we don't specify query limits, so default applies:
For Prometheus: storage.remote.read-concurrent-limit=10  
For Querier: query.max-concurrent = 20

So overall there is definitely a rate limitation on CPU spent overall - we might tweak it though if needed.

Just from our Thanos project standpoint:
We can definitely explore adding new limits to Querier for further rate limitation at some point, plus either deploying query cache or building the simple cache inside Querier, but it's a tradeoff which requires some RAM memory.

Anyway, ensuring this bug is somehow caused by high query traffic issue would be nice. Also the kind of queries matters e.g fetching thousands of series for single query is doable, but this would also result in high Querier Memory consumption.

NOTE: This is the adhoc queries I did:
<querier-url or Prometheus UI URL>/graph?g0.range_input=6h&g0.expr=sum(rate(prometheus_http_requests_total%7B%7D%5B1m%5D))%20by(handler%2C%20code)&g0.tab=0&g1.range_input=6h&g1.expr=%20sum(rate(container_cpu_user_seconds_total%7Bpod%3D~%22prometheus.*%22%7D%5B5m%5D))%20by%20(instance%2C%20pod%2C%20container)&g1.tab=0&g2.range_input=6h&g2.expr=%20sum(rate(container_cpu_user_seconds_total%7Bpod%3D~%22thanos.*%22%7D%5B5m%5D))%20by%20(instance%2C%20pod%2C%20container)&g2.tab=0

Comment 6 Bartek Plotka 2019-10-25 11:22:19 UTC

Additionally, I can see we disabled compaction on local Prometheus instances. Is it intentional? That can cause some slowdowns and issues for longer queries against both Prometheus and Qeurier, especially if our retention is 15d here.

Comment 7 Bartek Plotka 2019-10-28 10:24:04 UTC

Created this to track compaction option change: https://bugzilla.redhat.com/show_bug.cgi?id=1766111

Comment 8 Bartek Plotka 2019-10-29 15:06:20 UTC

After digging more, it looks like the query we use in console: `topk(20, sort_desc(sum(rate(container_cpu_user_seconds_total{container_name="", pod!=""}[5m])) by (pod,namespace)))` is not showing correct results. From first glance it looks like it doubles the results for each pod. This is because of cadvisor changes. (`container_name` -> `container`). We should probably update this to something like:

`topk(20, sort_desc(sum(rate(container_cpu_user_seconds_total{container="",container_name="", pod!=""}[5m])) by (pod,namespace)))` or just removed `container_name` totally.

Comment 13 errata-xmlrpc 2020-01-23 11:09:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062