Bug 2063063
| Summary: | The OpenShift documentation contains a PromQL query that may timeout and cause excessive load on Prometheus | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> |
| Component: | Documentation | Assignee: | Brian Burt <bburt> |
| Status: | CLOSED DEFERRED | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | Claire Bremble <cbremble> |
| Priority: | medium | ||
| Version: | 4.11 | CC: | anpicker, aos-bugs, bburt |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-02-10 16:05:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Closing this as deferred, to avoid a duplicate Jira issue being created. This work should be tracked here: https://issues.redhat.com/browse/RHDEVDOCS-3857 |
Description of problem: In the "Troubleshooting monitoring issues" section [1], the documentation tells users that they can run the following PromQL query to identify high-cardinality metrics: topk(10,count by (job)({__name__=~".+"})) This query is expensive which may trigger timeouts or even out-of-memory crashes. Instead we can document these queries: * "topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))" => top-10 jobs exposing the highest number of samples. * "topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))" => top-10 jobs that created most of the series in the last hour (helps to identify series churn). The queries can be tuned to return data only for the Platform or UWM Prometheus (e.g. '... scrape_samples_post_metric_relabeling{prometheus="prometheus/k8s"}' or '... scrape_samples_post_metric_relabeling{prometheus="prometheus/user-workload-monitoring"}'). The documentation also refers to the TSDB status page which is a good indicator too and should probably be put first in the list. [1] https://docs.openshift.com/container-platform/4.10/monitoring/troubleshooting-monitoring-issues.html#determining-why-prometheus-is-consuming-disk-space_troubleshooting-monitoring-issues Version-Release number of selected component (if applicable): 4.6 How reproducible: Always Additional info: