Bug 1812412 - Monitoring Dashboard: on restricted cluster, query timed out in expression evaluation
Summary: Monitoring Dashboard: on restricted cluster, query timed out in expression ev...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.7.0
Assignee: Andrew Pickering
QA Contact: Yadan Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-11 09:40 UTC by Yadan Pei
Modified: 2023-12-15 17:29 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:10:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
query timeout 503 (508.79 KB, image/png)
2020-03-11 09:47 UTC, Yadan Pei
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift console pull 7004 0 None closed Bug 1812412: Monitoring: Increase Prometheus query_range timeouts to 30s 2021-01-05 19:23:57 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:11:53 UTC

Description Yadan Pei 2020-03-11 09:40:45 UTC
Description of problem:
On restricted cluster I see this error `query timed out in expression evaluation ` very often when view grafana-dashboard-cluster-total and grafana-dashboard-k8s-resources-cluster dashboard. This error is not seen on connected cluster

Version-Release number of selected component (if applicable):
    4.4.0-0.nightly-2020-03-10-002851

How reproducible:
Very often on restricted cluster

Steps to Reproduce:
1. cluster admin view grafana-dashboard-cluster-total and grafana-dashboard-k8s-resources-cluster dashboard on a restricted cluster

Actual results:
1. We can see error on many charts: 
An error occurred
query timed out in expression evaluation

query GET request return 503 Service Unavailable


Expected results:
1. data should be loaded successfully and shown correctly

Additional info:

Comment 2 Yadan Pei 2020-03-11 09:47:20 UTC
Created attachment 1669199 [details]
query timeout 503

Comment 3 Yadan Pei 2020-03-11 09:49:02 UTC
But finally the charts can be loaded, lowering the severity to see if it's networking issue or not

Comment 4 David Taylor 2020-05-07 19:01:41 UTC
Can you please elaborate on what a 'restricted cluster' is and how to set one up?
Thanks, I plan on looking into this next sprint.

Comment 6 David Taylor 2020-05-29 15:00:09 UTC
I hope to work on this in the next sprint.

Comment 7 David Taylor 2020-07-23 18:05:26 UTC
Hi Yadan, can you please elaborate on what a 'restricted cluster' is and the steps you go through to set one up?  I see both your example clusters you were logged in as 'kubeadmin', so I don't believe you are talking about restricted from a user credentials point of view.

Comment 8 Yadan Pei 2020-07-27 06:17:37 UTC
(In reply to David Taylor from comment #7)
> Hi Yadan, can you please elaborate on what a 'restricted cluster' is and the
> steps you go through to set one up?  I see both your example clusters you
> were logged in as 'kubeadmin', so I don't believe you are talking about
> restricted from a user credentials point of view.

Hi David, the `restricted` cluster here means https://docs.openshift.com/container-platform/4.5/installing/installing_aws/installing-restricted-networks-aws.html#installation-about-restricted-networks_installing-restricted-networks-aws. Sorry I didn't know more details about installation process/details about `restricted` cluster

I didn't have a restricted cluster at my hand right now, if we will create some I will happy to share.

Comment 9 Samuel Padgett 2020-10-02 12:45:53 UTC
If the 503 is coming from the monitoring backend, this probably needs to be looked at by the monitoring team.

cc Andy

Comment 10 Nick Curry 2020-10-26 19:25:47 UTC
The query string that the UI is generating has a parameter for timeout that seems to be set to 5s, so if the query takes longer than 5 seconds it'll fail.

https://console-openshift-console.apps.ocp-prd02-azeastus2.ecm-p.eu2.azure.tsc/api/prometheus/api/v1/query_range?...&timeout=5s

I tested changing the value by hand to something like 20s and the data was properly returned.

Comment 11 Andrew Pickering 2020-10-27 08:51:06 UTC
The current timeout of 5 seconds is probably much too strict.

Prometheus has a maximum concurrent queries limit, which is currently set to 20. This should protect against overloading Prometheus, even if we increase the timeout significantly.

Also, if there are more than 20 queries, some will be queued and the time they spend in the queue will count towards the 5 second limit, which is another reason to make the timeout less strict.

Comment 13 Yadan Pei 2020-11-10 05:21:55 UTC
1. Launched a 4.7 restricted cluster
2. View grafana-dashboard-cluster-total and grafana-dashboard-k8s-resources-cluster dashboard, the error 'query timed out in expression evaluation' doesn't appear

Verified on 4.7.0-0.nightly-2020-11-09-235738

Comment 16 errata-xmlrpc 2021-02-24 15:10:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.