Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2063063

Summary: The OpenShift documentation contains a PromQL query that may timeout and cause excessive load on Prometheus
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: DocumentationAssignee: Brian Burt <bburt>
Status: CLOSED DEFERRED QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact: Claire Bremble <cbremble>
Priority: medium    
Version: 4.11CC: anpicker, aos-bugs, bburt
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-10 16:05:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Pasquier 2022-03-11 09:38:18 UTC
Description of problem:

In the "Troubleshooting monitoring issues" section [1], the documentation tells users that they can run the following PromQL query to identify high-cardinality metrics: 

topk(10,count by (job)({__name__=~".+"}))

This query is expensive which may trigger timeouts or even out-of-memory crashes.
 
Instead we can document these queries:

* "topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))" => top-10 jobs exposing the highest number of samples.  
* "topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))" => top-10 jobs that created most of the series in the last hour (helps to identify series churn).

The queries can be tuned to return data only for the Platform or UWM Prometheus (e.g. '... scrape_samples_post_metric_relabeling{prometheus="prometheus/k8s"}' or '... scrape_samples_post_metric_relabeling{prometheus="prometheus/user-workload-monitoring"}').

The documentation also refers to the TSDB status page which is a good indicator too and should probably be put first in the list.

[1] https://docs.openshift.com/container-platform/4.10/monitoring/troubleshooting-monitoring-issues.html#determining-why-prometheus-is-consuming-disk-space_troubleshooting-monitoring-issues

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Always


Additional info:

Comment 3 Claire Bremble 2023-02-10 16:05:38 UTC
Closing this as deferred, to avoid a duplicate Jira issue being created. This work should be tracked here: https://issues.redhat.com/browse/RHDEVDOCS-3857