1934163 – Thanos Querier restarting and gettin alert ThanosQueryHttpRequestQueryRangeErrorRateHigh

Bug 1934163 - Thanos Querier restarting and gettin alert ThanosQueryHttpRequestQueryRangeErrorRateHigh

Summary: Thanos Querier restarting and gettin alert ThanosQueryHttpRequestQueryRangeEr...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-02 15:57 UTC by Steven Walter
Modified:	2024-10-01 17:35 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:49:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1087	0	None	open	Bug 1934163: adjust Thanos querier alerting rules	2021-03-23 13:32:49 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:51:26 UTC

Description Steven Walter 2021-03-02 15:57:10 UTC

Description of problem:
thanos-querier restarting occasionally and getting alert:


Labels
alertname = ThanosQueryHttpRequestQueryRangeErrorRateHigh
prometheus = openshift-monitoring/k8s
severity = critical
Annotations
message = Thanos Query is failing to handle 8.889% of "query_range" requests.
Source

Version-Release number of selected component (if applicable):
4.6


Actual results:
The logs of the thanos-querier before restart:

2021-03-02T09:57:45.411906622Z level=info ts=2021-03-02T09:57:45.411839847Z caller=main.go:223 msg="caught signal. Exiting." signal=terminated
2021-03-02T09:57:45.411951542Z level=warn ts=2021-03-02T09:57:45.411934211Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
2021-03-02T09:57:45.411974369Z level=info ts=2021-03-02T09:57:45.411953055Z caller=http.go:64 service=http/server component=query msg="internal server is shutting down" err=null
2021-03-02T09:57:45.543962653Z level=warn ts=2021-03-02T09:57:45.543891219Z caller=storeset.go:488 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.16.118:10901: rpc error: code = Canceled desc = context canceled" address=10.129.16.118:10901
2021-03-02T09:57:45.544014380Z level=warn ts=2021-03-02T09:57:45.54396198Z caller=storeset.go:488 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.130.8.29:10901: rpc error: code = Canceled desc = context canceled" address=10.130.8.29:10901
2021-03-02T09:57:45.544059905Z level=warn ts=2021-03-02T09:57:45.544008906Z caller=runutil.go:123 component=storeset msg="detected close error" err="store 10.129.16.118:10901 connection close: rpc error: code = Canceled desc = grpc: the client connection is closing"
2021-03-02T09:57:45.544086547Z level=info ts=2021-03-02T09:57:45.544064502Z caller=storeset.go:389 component=storeset msg="removing store because it's unhealthy or does not exist" address=10.129.16.118:10901 extLset="{prometheus=\"openshift-monitoring/k8s\", prometheus_replica=\"prometheus-k8s-1\"}"
2021-03-02T09:57:45.544108471Z level=warn ts=2021-03-02T09:57:45.544087378Z caller=runutil.go:123 component=storeset msg="detected close error" err="store 10.130.8.29:10901 connection close: rpc error: code = Canceled desc = grpc: the client connection is closing"
2021-03-02T09:57:45.544130510Z level=info ts=2021-03-02T09:57:45.544109834Z caller=storeset.go:389 component=storeset msg="removing store because it's unhealthy or does not exist" address=10.130.8.29:10901 extLset="{prometheus=\"openshift-monitoring/k8s\", prometheus_replica=\"prometheus-k8s-0\"}"

Expected results:


Additional info:
I will post a link for the must-gather

Comment 1 Steven Walter 2021-03-02 15:59:38 UTC

Issue began occurring on AWS cluster after changing machine types from m5.xlarge (16 GB RAM) to r5.xlarge (32 GB RAM)

Looking at the Thanos sidecar in the Prometheus pod, I see for a while we were getting 503 messages:

2021-03-01T10:41:30.726676148Z level=warn ts=2021-03-01T10:41:30.726601548Z caller=sidecar.go:156 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
2021-03-01T10:41:30.726772389Z level=warn ts=2021-03-01T10:41:30.726747564Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"

After some time, these changed to context deadline exceeded messages, which I believe means timeout/performance issues:

2021-03-01T10:41:33.368760942Z level=info ts=2021-03-01T10:41:33.36873604Z caller=intrumentation.go:48 msg="changing probe status" status=ready
2021-03-01T10:42:08.391325504Z level=warn ts=2021-03-01T10:42:08.381614768Z caller=sidecar.go:189 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded"
2021-03-01T10:48:09.018580750Z level=warn ts=2021-03-01T10:48:09.001576599Z caller=sidecar.go:189 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded"

I see some errors in Prometheus itself which might be of some interest. In prometheus-k8s-0:

2021-03-01T11:19:19.388701979Z level=warn ts=2021-03-01T11:19:19.388Z caller=manager.go:641 component="rule manager" group=k8s.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=78
2021-03-01T11:19:49.199923767Z level=warn ts=2021-03-01T11:19:49.199Z caller=manager.go:644 component="rule manager" group=k8s.rules msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=30

In prometheus-k8s-1:

2021-03-01T12:15:35.856362422Z 2021/03/01 12:15:35 oauthproxy.go:785: basicauth: 10.128.16.8:58446 Authorization header does not start with 'Basic', skipping basic authentication
2021-03-01T12:20:07.628059556Z 2021/03/01 12:20:07 oauthproxy.go:785: basicauth: 10.128.16.8:32806 Authorization header does not start with 'Basic', skipping basic authentication

Comment 20 Junqi Zhao 2021-03-30 08:48:23 UTC

checked with 4.8.0-0.nightly-2021-03-30-023016, alert rules for duration are all 1h, and severity is warning

Comment 29 errata-xmlrpc 2021-07-27 22:49:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.