2070596 – Grafana dashboard fails to load when the query to Prometheus takes more than 30s to return

Bug 2070596 - Grafana dashboard fails to load when the query to Prometheus takes more than 30s to return

Summary: Grafana dashboard fails to load when the query to Prometheus takes more than ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Arunprasad Rajkumar
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2083460
TreeView+	depends on / blocked

Reported:	2022-03-31 13:38 UTC by Simon Pasquier
Modified:	2022-12-30 09:08 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2083460 (view as bug list)
Environment:
Last Closed:	2022-05-10 07:14:06 UTC
Target Upstream Version:
Embargoed:
Flags:	arajkuma: needinfo- arajkuma: needinfo-

Attachments	(Terms of Use)

Description Simon Pasquier 2022-03-31 13:38:40 UTC

Description of problem:
On one side, Grafana has a default timeout value of 30s when querying the Prometheus datasource. On the other side, the (default) query timeout of Prometheus is 2 minutes. This means that when a dashboard query takes more than 30s to return, Grafana will fail with "no data" + "bad gateway" while the backend is still processing the request.

Version-Release number of selected component (if applicable):
4.8

How reproducible:
Not always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
There's a timeout parameter in grafana.ini under the [dataproxy] section [1] as well as a timeout parameter in the datasource configuration [2]. But the initial testing on OCP 4.10 (Grafana v8.3.4) seems to indicate that increasing the parameters doesn't have any impact on the issue.
Upstream issue #34177 [3] looks similar but it's supposed to be fixed in v8.1.0 and later [4].

[1] https://grafana.com/docs/grafana/latest/administration/configuration/#dataproxy
[2] https://grafana.com/docs/grafana/latest/administration/provisioning/#json-data
[3] https://github.com/grafana/grafana/issues/34177
[4] https://github.com/grafana/grafana/commit/91657dad182127bf577b449ff9d94e5bf86e592a

Comment 3 Arunprasad Rajkumar 2022-04-07 11:44:07 UTC

We tried to replicate the problem by adding an artificial delay[1] to prometheus query path(which simulates slow query). In this exercise we found 2 key reasons,

1. Openshift Route which exposes the grafana has a timeout value of 30s[2], which might return "Gateway timeout" during long running / slow data proxies.
2. Low CPU limit enforced on Thanos Querier might lead to Console Dashboard timeouts


#1 shall be fixed by increasing the default router timeout for grafana Route by adding `haproxy.router.openshift.io/timeout: 5m` annotation[1].
#2 shall be fixed by bumping the CPU limit[2] of Thanos Querier incase if faces frequent CPU throttling  


[1] https://github.com/openshift/prometheus/pull/125
[2] https://docs.openshift.com/container-platform/4.8/networking/routes/route-configuration.html#nw-configuring-route-timeouts_route-configuration
[3] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html

Comment 13 Arunprasad Rajkumar 2022-04-26 15:14:38 UTC

I see that Grafana honours dataproxy.timeout which is been set as part of main config file[1]. Increasing both Grafana Openshift Route and data proxy timeout must fix at least Grafana timeout problem.


```
# grafana.ini

[dataproxy]
timeout = 120 #seconds

```

[1] https://github.com/openshift/cluster-monitoring-operator/blob/95f04c190f068badc1ff388d1a55a5f2dcb15af3/assets/grafana/config.yaml#L12

Note You need to log in before you can comment on or make changes to this bug.