Bug 2070596 - Grafana dashboard fails to load when the query to Prometheus takes more than 30s to return
Summary: Grafana dashboard fails to load when the query to Prometheus takes more than ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.11.0
Assignee: Arunprasad Rajkumar
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 2083460
TreeView+ depends on / blocked
 
Reported: 2022-03-31 13:38 UTC by Simon Pasquier
Modified: 2022-12-30 09:08 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2083460 (view as bug list)
Environment:
Last Closed: 2022-05-10 07:14:06 UTC
Target Upstream Version:
Embargoed:
arajkuma: needinfo-
arajkuma: needinfo-


Attachments (Terms of Use)

Description Simon Pasquier 2022-03-31 13:38:40 UTC
Description of problem:
On one side, Grafana has a default timeout value of 30s when querying the Prometheus datasource. On the other side, the (default) query timeout of Prometheus is 2 minutes. This means that when a dashboard query takes more than 30s to return, Grafana will fail with "no data" + "bad gateway" while the backend is still processing the request.

Version-Release number of selected component (if applicable):
4.8

How reproducible:
Not always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
There's a timeout parameter in grafana.ini under the [dataproxy] section [1] as well as a timeout parameter in the datasource configuration [2]. But the initial testing on OCP 4.10 (Grafana v8.3.4) seems to indicate that increasing the parameters doesn't have any impact on the issue.
Upstream issue #34177 [3] looks similar but it's supposed to be fixed in v8.1.0 and later [4].

[1] https://grafana.com/docs/grafana/latest/administration/configuration/#dataproxy
[2] https://grafana.com/docs/grafana/latest/administration/provisioning/#json-data
[3] https://github.com/grafana/grafana/issues/34177
[4] https://github.com/grafana/grafana/commit/91657dad182127bf577b449ff9d94e5bf86e592a

Comment 3 Arunprasad Rajkumar 2022-04-07 11:44:07 UTC
We tried to replicate the problem by adding an artificial delay[1] to prometheus query path(which simulates slow query). In this exercise we found 2 key reasons,

1. Openshift Route which exposes the grafana has a timeout value of 30s[2], which might return "Gateway timeout" during long running / slow data proxies.
2. Low CPU limit enforced on Thanos Querier might lead to Console Dashboard timeouts


#1 shall be fixed by increasing the default router timeout for grafana Route by adding `haproxy.router.openshift.io/timeout: 5m` annotation[1].
#2 shall be fixed by bumping the CPU limit[2] of Thanos Querier incase if faces frequent CPU throttling  


[1] https://github.com/openshift/prometheus/pull/125
[2] https://docs.openshift.com/container-platform/4.8/networking/routes/route-configuration.html#nw-configuring-route-timeouts_route-configuration
[3] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html

Comment 13 Arunprasad Rajkumar 2022-04-26 15:14:38 UTC
I see that Grafana honours dataproxy.timeout which is been set as part of main config file[1]. Increasing both Grafana Openshift Route and data proxy timeout must fix at least Grafana timeout problem.


```
# grafana.ini

[dataproxy]
timeout = 120 #seconds

```

[1] https://github.com/openshift/cluster-monitoring-operator/blob/95f04c190f068badc1ff388d1a55a5f2dcb15af3/assets/grafana/config.yaml#L12


Note You need to log in before you can comment on or make changes to this bug.