Bug 2070596

Summary: Grafana dashboard fails to load when the query to Prometheus takes more than 30s to return
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.8CC: aharchin, amuller, anpicker, aos-bugs, erooth, gbernal, jfajersk, kiyyappa, kurathod, kweg, openshift-bugs-escalate, ssadhale
Target Milestone: ---Flags: arajkuma: needinfo-
arajkuma: needinfo-
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2083460 (view as bug list) Environment:
Last Closed: 2022-05-10 07:14:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2083460    

Description Simon Pasquier 2022-03-31 13:38:40 UTC
Description of problem:
On one side, Grafana has a default timeout value of 30s when querying the Prometheus datasource. On the other side, the (default) query timeout of Prometheus is 2 minutes. This means that when a dashboard query takes more than 30s to return, Grafana will fail with "no data" + "bad gateway" while the backend is still processing the request.

Version-Release number of selected component (if applicable):
4.8

How reproducible:
Not always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
There's a timeout parameter in grafana.ini under the [dataproxy] section [1] as well as a timeout parameter in the datasource configuration [2]. But the initial testing on OCP 4.10 (Grafana v8.3.4) seems to indicate that increasing the parameters doesn't have any impact on the issue.
Upstream issue #34177 [3] looks similar but it's supposed to be fixed in v8.1.0 and later [4].

[1] https://grafana.com/docs/grafana/latest/administration/configuration/#dataproxy
[2] https://grafana.com/docs/grafana/latest/administration/provisioning/#json-data
[3] https://github.com/grafana/grafana/issues/34177
[4] https://github.com/grafana/grafana/commit/91657dad182127bf577b449ff9d94e5bf86e592a

Comment 3 Arunprasad Rajkumar 2022-04-07 11:44:07 UTC
We tried to replicate the problem by adding an artificial delay[1] to prometheus query path(which simulates slow query). In this exercise we found 2 key reasons,

1. Openshift Route which exposes the grafana has a timeout value of 30s[2], which might return "Gateway timeout" during long running / slow data proxies.
2. Low CPU limit enforced on Thanos Querier might lead to Console Dashboard timeouts


#1 shall be fixed by increasing the default router timeout for grafana Route by adding `haproxy.router.openshift.io/timeout: 5m` annotation[1].
#2 shall be fixed by bumping the CPU limit[2] of Thanos Querier incase if faces frequent CPU throttling  


[1] https://github.com/openshift/prometheus/pull/125
[2] https://docs.openshift.com/container-platform/4.8/networking/routes/route-configuration.html#nw-configuring-route-timeouts_route-configuration
[3] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html

Comment 13 Arunprasad Rajkumar 2022-04-26 15:14:38 UTC
I see that Grafana honours dataproxy.timeout which is been set as part of main config file[1]. Increasing both Grafana Openshift Route and data proxy timeout must fix at least Grafana timeout problem.


```
# grafana.ini

[dataproxy]
timeout = 120 #seconds

```

[1] https://github.com/openshift/cluster-monitoring-operator/blob/95f04c190f068badc1ff388d1a55a5f2dcb15af3/assets/grafana/config.yaml#L12