Bug 2070596

Summary:	Grafana dashboard fails to load when the query to Prometheus takes more than 30s to return
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Monitoring	Assignee:	Arunprasad Rajkumar <arajkuma>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Junqi Zhao <juzhao>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.8	CC:	aharchin, amuller, anpicker, aos-bugs, erooth, gbernal, jfajersk, kiyyappa, kurathod, kweg, openshift-bugs-escalate, ssadhale
Target Milestone:	---	Flags:	arajkuma: needinfo- arajkuma: needinfo-
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2083460 (view as bug list)		Environment:
Last Closed:	2022-05-10 07:14:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2083460

Description Simon Pasquier 2022-03-31 13:38:40 UTC

Description of problem:
On one side, Grafana has a default timeout value of 30s when querying the Prometheus datasource. On the other side, the (default) query timeout of Prometheus is 2 minutes. This means that when a dashboard query takes more than 30s to return, Grafana will fail with "no data" + "bad gateway" while the backend is still processing the request.

Version-Release number of selected component (if applicable):
4.8

How reproducible:
Not always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
There's a timeout parameter in grafana.ini under the [dataproxy] section [1] as well as a timeout parameter in the datasource configuration [2]. But the initial testing on OCP 4.10 (Grafana v8.3.4) seems to indicate that increasing the parameters doesn't have any impact on the issue.
Upstream issue #34177 [3] looks similar but it's supposed to be fixed in v8.1.0 and later [4].

[1] https://grafana.com/docs/grafana/latest/administration/configuration/#dataproxy
[2] https://grafana.com/docs/grafana/latest/administration/provisioning/#json-data
[3] https://github.com/grafana/grafana/issues/34177
[4] https://github.com/grafana/grafana/commit/91657dad182127bf577b449ff9d94e5bf86e592a

Comment 3 Arunprasad Rajkumar 2022-04-07 11:44:07 UTC

We tried to replicate the problem by adding an artificial delay[1] to prometheus query path(which simulates slow query). In this exercise we found 2 key reasons,

1. Openshift Route which exposes the grafana has a timeout value of 30s[2], which might return "Gateway timeout" during long running / slow data proxies.
2. Low CPU limit enforced on Thanos Querier might lead to Console Dashboard timeouts


#1 shall be fixed by increasing the default router timeout for grafana Route by adding `haproxy.router.openshift.io/timeout: 5m` annotation[1].
#2 shall be fixed by bumping the CPU limit[2] of Thanos Querier incase if faces frequent CPU throttling  


[1] https://github.com/openshift/prometheus/pull/125
[2] https://docs.openshift.com/container-platform/4.8/networking/routes/route-configuration.html#nw-configuring-route-timeouts_route-configuration
[3] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html

Comment 13 Arunprasad Rajkumar 2022-04-26 15:14:38 UTC

I see that Grafana honours dataproxy.timeout which is been set as part of main config file[1]. Increasing both Grafana Openshift Route and data proxy timeout must fix at least Grafana timeout problem.


```
# grafana.ini

[dataproxy]
timeout = 120 #seconds

```

[1] https://github.com/openshift/cluster-monitoring-operator/blob/95f04c190f068badc1ff388d1a55a5f2dcb15af3/assets/grafana/config.yaml#L12