Description of problem: On restricted cluster I see this error `query timed out in expression evaluation ` very often when view grafana-dashboard-cluster-total and grafana-dashboard-k8s-resources-cluster dashboard. This error is not seen on connected cluster Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-10-002851 How reproducible: Very often on restricted cluster Steps to Reproduce: 1. cluster admin view grafana-dashboard-cluster-total and grafana-dashboard-k8s-resources-cluster dashboard on a restricted cluster Actual results: 1. We can see error on many charts: An error occurred query timed out in expression evaluation query GET request return 503 Service Unavailable Expected results: 1. data should be loaded successfully and shown correctly Additional info:
Created attachment 1669199 [details] query timeout 503
But finally the charts can be loaded, lowering the severity to see if it's networking issue or not
Can you please elaborate on what a 'restricted cluster' is and how to set one up? Thanks, I plan on looking into this next sprint.
I hope to work on this in the next sprint.
Hi Yadan, can you please elaborate on what a 'restricted cluster' is and the steps you go through to set one up? I see both your example clusters you were logged in as 'kubeadmin', so I don't believe you are talking about restricted from a user credentials point of view.
(In reply to David Taylor from comment #7) > Hi Yadan, can you please elaborate on what a 'restricted cluster' is and the > steps you go through to set one up? I see both your example clusters you > were logged in as 'kubeadmin', so I don't believe you are talking about > restricted from a user credentials point of view. Hi David, the `restricted` cluster here means https://docs.openshift.com/container-platform/4.5/installing/installing_aws/installing-restricted-networks-aws.html#installation-about-restricted-networks_installing-restricted-networks-aws. Sorry I didn't know more details about installation process/details about `restricted` cluster I didn't have a restricted cluster at my hand right now, if we will create some I will happy to share.
If the 503 is coming from the monitoring backend, this probably needs to be looked at by the monitoring team. cc Andy
The query string that the UI is generating has a parameter for timeout that seems to be set to 5s, so if the query takes longer than 5 seconds it'll fail. https://console-openshift-console.apps.ocp-prd02-azeastus2.ecm-p.eu2.azure.tsc/api/prometheus/api/v1/query_range?...&timeout=5s I tested changing the value by hand to something like 20s and the data was properly returned.
The current timeout of 5 seconds is probably much too strict. Prometheus has a maximum concurrent queries limit, which is currently set to 20. This should protect against overloading Prometheus, even if we increase the timeout significantly. Also, if there are more than 20 queries, some will be queued and the time they spend in the queue will count towards the 5 second limit, which is another reason to make the timeout less strict.
1. Launched a 4.7 restricted cluster 2. View grafana-dashboard-cluster-total and grafana-dashboard-k8s-resources-cluster dashboard, the error 'query timed out in expression evaluation' doesn't appear Verified on 4.7.0-0.nightly-2020-11-09-235738
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633