Description of problem: On one side, Grafana has a default timeout value of 30s when querying the Prometheus datasource. On the other side, the (default) query timeout of Prometheus is 2 minutes. This means that when a dashboard query takes more than 30s to return, Grafana will fail with "no data" + "bad gateway" while the backend is still processing the request. Version-Release number of selected component (if applicable): 4.8 How reproducible: Not always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: There's a timeout parameter in grafana.ini under the [dataproxy] section [1] as well as a timeout parameter in the datasource configuration [2]. But the initial testing on OCP 4.10 (Grafana v8.3.4) seems to indicate that increasing the parameters doesn't have any impact on the issue. Upstream issue #34177 [3] looks similar but it's supposed to be fixed in v8.1.0 and later [4]. [1] https://grafana.com/docs/grafana/latest/administration/configuration/#dataproxy [2] https://grafana.com/docs/grafana/latest/administration/provisioning/#json-data [3] https://github.com/grafana/grafana/issues/34177 [4] https://github.com/grafana/grafana/commit/91657dad182127bf577b449ff9d94e5bf86e592a
We tried to replicate the problem by adding an artificial delay[1] to prometheus query path(which simulates slow query). In this exercise we found 2 key reasons, 1. Openshift Route which exposes the grafana has a timeout value of 30s[2], which might return "Gateway timeout" during long running / slow data proxies. 2. Low CPU limit enforced on Thanos Querier might lead to Console Dashboard timeouts #1 shall be fixed by increasing the default router timeout for grafana Route by adding `haproxy.router.openshift.io/timeout: 5m` annotation[1]. #2 shall be fixed by bumping the CPU limit[2] of Thanos Querier incase if faces frequent CPU throttling [1] https://github.com/openshift/prometheus/pull/125 [2] https://docs.openshift.com/container-platform/4.8/networking/routes/route-configuration.html#nw-configuring-route-timeouts_route-configuration [3] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html
I see that Grafana honours dataproxy.timeout which is been set as part of main config file[1]. Increasing both Grafana Openshift Route and data proxy timeout must fix at least Grafana timeout problem. ``` # grafana.ini [dataproxy] timeout = 120 #seconds ``` [1] https://github.com/openshift/cluster-monitoring-operator/blob/95f04c190f068badc1ff388d1a55a5f2dcb15af3/assets/grafana/config.yaml#L12