Created attachment 1735441 [details] thanos-querier logs Created attachment 1735441 [details] thanos-querier logs Description of problem: login cluster console, and check from "Home -> Overview" page, "Cluster utilization" section, in the first a few hours, sometimes the metrics could be shown, but sometimes it is "No datapoints found." take CPU as an example its expression is: /api/prometheus/api/v1/query?query=sum(cluster:cpu_usage_cores:sum) debug from console, when the graph shows "No datapoints found.", the HTTP response is {"status":"success","data":{"resultType":"vector","result":[]},"warnings":["No StoreAPIs matched for this query"]} which we find the same warings from thanos-querier # oc -n openshift-monitoring logs thanos-querier-6d6b76b6b4-qn5k6 -c thanos-query ... level=error ts=2020-12-02T01:23:32.963475756Z caller=query.go:397 msg="failed to resolve addresses for rulesAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled" ... level=warn ts=2020-12-02T01:31:03.254960034Z caller=storeset.go:456 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.2.22:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.129.2.22:10901: connect: no route to host\"" address=10.129.2.22:10901 level=warn ts=2020-12-02T01:31:08.255238189Z caller=storeset.go:456 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.2.22:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.129.2.22:10901: connect: no route to host\"" address=10.129.2.22:10901 level=warn ts=2020-12-02T01:31:09.870925079Z caller=proxy.go:287 err="No StoreAPIs matched for this query" stores= level=warn ts=2020-12-02T01:31:09.870925731Z caller=proxy.go:287 err="No StoreAPIs matched for this query" stores= .... After a few hours, the dashboard shows "No datapoints found." for all metrics, and check the API, it is not stable # token=`oc sa get-token prometheus-k8s -n openshift-monitoring`; # for i in {1..10}; do echo $i; str=`oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "cluster:cpu_usage_cores:sum"`; if [ -z $str ]; then echo "don't find cluster:cpu_usage_cores:sum"; else echo "find cluster:cpu_usage_cores:sum"; fi;sleep 10s; done 1 find cluster:cpu_usage_cores:sum 2 find cluster:cpu_usage_cores:sum 3 don't find cluster:cpu_usage_cores:sum 4 don't find cluster:cpu_usage_cores:sum 5 don't find cluster:cpu_usage_cores:sum 6 don't find cluster:cpu_usage_cores:sum 7 find cluster:cpu_usage_cores:sum 8 don't find cluster:cpu_usage_cores:sum 9 find cluster:cpu_usage_cores:sum 10 don't find cluster:cpu_usage_cores:sum Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-11-30-172451 How reproducible: always Steps to Reproduce: 1. see the description 2. 3. Actual results: Expected results: Additional info:
Created attachment 1735442 [details] thanos-querier deployment file
Created attachment 1735443 [details] metrics are shown sometimes
Created attachment 1735444 [details] no metrics after a while
we also see the 2 alerts are triggerd "alertname": "ThanosQueryGrpcClientErrorRate", "description": "Thanos Query thanos-querier is failing to send 45.76% of requests." "alertname": "ThanosQueryHighDNSFailures", "description": "Thanos Query thanos-querier have 100% of failing DNS queries for store endpoints."
AFAICT it's similar to bug 1897252 which is actively worked on. *** This bug has been marked as a duplicate of bug 1897252 ***