Bug 1765315
Summary: | High CPU usage in prometheus, thanos-querier while single user sits on dashboard | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | ||||||||
Component: | Monitoring | Assignee: | Paul Gier <pgier> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.3.0 | CC: | alegrand, anpicker, bplotka, erooth, juzhao, kakkoyun, lcosic, mloibl, pkrupa, surbania | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.3.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2020-01-23 11:09:08 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Clayton Coleman
2019-10-24 19:34:59 UTC
Created attachment 1629054 [details]
so far i cannot unfortunately reproduce this. i get a figure as attached.
@bartek: do you mind to look into this? Here is a link to the prometheus dump for a recent e2e run: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/522/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1517/artifacts/e2e-aws/metrics/ Created attachment 1629108 [details]
Query/CPU Metrics from Prometheus dump.
Hi! So I tried the provided dump, but I think we might miss some data (attached in comment above). This dump covers only 30m and there was no thanos-querier, no CPU traffic as well that was hitting Prometheus at that moment. Maybe you wanted to copy some blocks as well? (Only WAL files present) Created attachment 1629122 [details]
Query/CPU Metrics from 4.3.0-0.ci-2019-10-25-031503
We tried to reproduce it in a new ephemeral cluster and indeed dashboards even for a single user generate a bit of traffic (~15 requests per second from Querier to Prometheus), however, it really does not use much of the CPU.
I am not familiar with this console top CPU spend, wonder what queries we do there? Or again, maybe the dump was not at the moment of the high traffic.
I looked through the configuration and we don't specify query limits, so default applies:
For Prometheus: storage.remote.read-concurrent-limit=10
For Querier: query.max-concurrent = 20
So overall there is definitely a rate limitation on CPU spent overall - we might tweak it though if needed.
Just from our Thanos project standpoint:
We can definitely explore adding new limits to Querier for further rate limitation at some point, plus either deploying query cache or building the simple cache inside Querier, but it's a tradeoff which requires some RAM memory.
Anyway, ensuring this bug is somehow caused by high query traffic issue would be nice. Also the kind of queries matters e.g fetching thousands of series for single query is doable, but this would also result in high Querier Memory consumption.
NOTE: This is the adhoc queries I did:
<querier-url or Prometheus UI URL>/graph?g0.range_input=6h&g0.expr=sum(rate(prometheus_http_requests_total%7B%7D%5B1m%5D))%20by(handler%2C%20code)&g0.tab=0&g1.range_input=6h&g1.expr=%20sum(rate(container_cpu_user_seconds_total%7Bpod%3D~%22prometheus.*%22%7D%5B5m%5D))%20by%20(instance%2C%20pod%2C%20container)&g1.tab=0&g2.range_input=6h&g2.expr=%20sum(rate(container_cpu_user_seconds_total%7Bpod%3D~%22thanos.*%22%7D%5B5m%5D))%20by%20(instance%2C%20pod%2C%20container)&g2.tab=0
Additionally, I can see we disabled compaction on local Prometheus instances. Is it intentional? That can cause some slowdowns and issues for longer queries against both Prometheus and Qeurier, especially if our retention is 15d here. Created this to track compaction option change: https://bugzilla.redhat.com/show_bug.cgi?id=1766111 After digging more, it looks like the query we use in console: `topk(20, sort_desc(sum(rate(container_cpu_user_seconds_total{container_name="", pod!=""}[5m])) by (pod,namespace)))` is not showing correct results. From first glance it looks like it doubles the results for each pod. This is because of cadvisor changes. (`container_name` -> `container`). We should probably update this to something like: `topk(20, sort_desc(sum(rate(container_cpu_user_seconds_total{container="",container_name="", pod!=""}[5m])) by (pod,namespace)))` or just removed `container_name` totally. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |