1903423 – thanos-querier API is not stable in UI and there are "operation was canceled" errors in thanos-querier logs

Bug 1903423 - thanos-querier API is not stable in UI and there are "operation was canceled" errors in thanos-querier logs

Summary: thanos-querier API is not stable in UI and there are "operation was canceled"...

Keywords:
Status:	CLOSED DUPLICATE of bug 1897252
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-02 03:22 UTC by Junqi Zhao
Modified:	2020-12-02 09:14 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-02 09:14:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
thanos-querier logs (2.81 MB, text/plain) 2020-12-02 03:22 UTC, Junqi Zhao	no flags	Details
thanos-querier deployment file (19.57 KB, text/plain) 2020-12-02 03:24 UTC, Junqi Zhao	no flags	Details
metrics are shown sometimes (130.49 KB, image/png) 2020-12-02 03:25 UTC, Junqi Zhao	no flags	Details
no metrics after a while (100.45 KB, image/png) 2020-12-02 03:25 UTC, Junqi Zhao	no flags	Details
View All

Description Junqi Zhao 2020-12-02 03:22:17 UTC

Created attachment 1735441 [details]
thanos-querier logs

Created attachment 1735441 [details]
thanos-querier logs

Description of problem:
login cluster console, and check from "Home -> Overview" page, "Cluster utilization" section, in the first a few hours, sometimes the metrics could be shown, but sometimes it is "No datapoints found."
take CPU as an example
its expression is: /api/prometheus/api/v1/query?query=sum(cluster:cpu_usage_cores:sum)
debug from console, when the graph shows "No datapoints found.", the HTTP response is
{"status":"success","data":{"resultType":"vector","result":[]},"warnings":["No StoreAPIs matched for this query"]}

which we find the same warings from thanos-querier
# oc -n openshift-monitoring logs thanos-querier-6d6b76b6b4-qn5k6 -c thanos-query
...
level=error ts=2020-12-02T01:23:32.963475756Z caller=query.go:397 msg="failed to resolve addresses for rulesAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled"
...
level=warn ts=2020-12-02T01:31:03.254960034Z caller=storeset.go:456 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.2.22:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.129.2.22:10901: connect: no route to host\"" address=10.129.2.22:10901
level=warn ts=2020-12-02T01:31:08.255238189Z caller=storeset.go:456 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.2.22:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.129.2.22:10901: connect: no route to host\"" address=10.129.2.22:10901
level=warn ts=2020-12-02T01:31:09.870925079Z caller=proxy.go:287 err="No StoreAPIs matched for this query" stores=
level=warn ts=2020-12-02T01:31:09.870925731Z caller=proxy.go:287 err="No StoreAPIs matched for this query" stores=
....

After a few hours, the dashboard shows "No datapoints found." for all metrics, and check the API, it is not stable

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`;
# for i in {1..10}; do echo $i; str=`oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "cluster:cpu_usage_cores:sum"`; if [ -z $str ]; then echo "don't find cluster:cpu_usage_cores:sum"; else echo "find cluster:cpu_usage_cores:sum"; fi;sleep 10s; done
1
find cluster:cpu_usage_cores:sum
2
find cluster:cpu_usage_cores:sum
3
don't find cluster:cpu_usage_cores:sum
4
don't find cluster:cpu_usage_cores:sum
5
don't find cluster:cpu_usage_cores:sum
6
don't find cluster:cpu_usage_cores:sum
7
find cluster:cpu_usage_cores:sum
8
don't find cluster:cpu_usage_cores:sum
9
find cluster:cpu_usage_cores:sum
10
don't find cluster:cpu_usage_cores:sum


Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-11-30-172451

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Junqi Zhao 2020-12-02 03:24:14 UTC

Created attachment 1735442 [details]
thanos-querier deployment file

Comment 2 Junqi Zhao 2020-12-02 03:25:05 UTC

Created attachment 1735443 [details]
metrics are shown sometimes

Comment 3 Junqi Zhao 2020-12-02 03:25:48 UTC

Created attachment 1735444 [details]
no metrics after a while

Comment 4 Junqi Zhao 2020-12-02 03:31:58 UTC

we also see the 2 alerts are triggerd
"alertname": "ThanosQueryGrpcClientErrorRate", "description": "Thanos Query thanos-querier is failing to send 45.76% of requests."
"alertname": "ThanosQueryHighDNSFailures", "description": "Thanos Query thanos-querier have 100% of failing DNS queries for store endpoints."

Comment 5 Simon Pasquier 2020-12-02 09:14:27 UTC

AFAICT it's similar to bug 1897252 which is actively worked on.

*** This bug has been marked as a duplicate of bug 1897252 ***

Note You need to log in before you can comment on or make changes to this bug.