Bug 2070596
Summary: | Grafana dashboard fails to load when the query to Prometheus takes more than 30s to return | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> | |
Component: | Monitoring | Assignee: | Arunprasad Rajkumar <arajkuma> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Junqi Zhao <juzhao> | |
Severity: | low | Docs Contact: | ||
Priority: | low | |||
Version: | 4.8 | CC: | aharchin, amuller, anpicker, aos-bugs, erooth, gbernal, jfajersk, kiyyappa, kurathod, kweg, openshift-bugs-escalate, ssadhale | |
Target Milestone: | --- | Flags: | arajkuma:
needinfo-
arajkuma: needinfo- |
|
Target Release: | 4.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2083460 (view as bug list) | Environment: | ||
Last Closed: | 2022-05-10 07:14:06 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2083460 |
Description
Simon Pasquier
2022-03-31 13:38:40 UTC
We tried to replicate the problem by adding an artificial delay[1] to prometheus query path(which simulates slow query). In this exercise we found 2 key reasons, 1. Openshift Route which exposes the grafana has a timeout value of 30s[2], which might return "Gateway timeout" during long running / slow data proxies. 2. Low CPU limit enforced on Thanos Querier might lead to Console Dashboard timeouts #1 shall be fixed by increasing the default router timeout for grafana Route by adding `haproxy.router.openshift.io/timeout: 5m` annotation[1]. #2 shall be fixed by bumping the CPU limit[2] of Thanos Querier incase if faces frequent CPU throttling [1] https://github.com/openshift/prometheus/pull/125 [2] https://docs.openshift.com/container-platform/4.8/networking/routes/route-configuration.html#nw-configuring-route-timeouts_route-configuration [3] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html I see that Grafana honours dataproxy.timeout which is been set as part of main config file[1]. Increasing both Grafana Openshift Route and data proxy timeout must fix at least Grafana timeout problem. ``` # grafana.ini [dataproxy] timeout = 120 #seconds ``` [1] https://github.com/openshift/cluster-monitoring-operator/blob/95f04c190f068badc1ff388d1a55a5f2dcb15af3/assets/grafana/config.yaml#L12 |