Description of problem: thanos-querier restarting occasionally and getting alert: Labels alertname = ThanosQueryHttpRequestQueryRangeErrorRateHigh prometheus = openshift-monitoring/k8s severity = critical Annotations message = Thanos Query is failing to handle 8.889% of "query_range" requests. Source Version-Release number of selected component (if applicable): 4.6 Actual results: The logs of the thanos-querier before restart: 2021-03-02T09:57:45.411906622Z level=info ts=2021-03-02T09:57:45.411839847Z caller=main.go:223 msg="caught signal. Exiting." signal=terminated 2021-03-02T09:57:45.411951542Z level=warn ts=2021-03-02T09:57:45.411934211Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null 2021-03-02T09:57:45.411974369Z level=info ts=2021-03-02T09:57:45.411953055Z caller=http.go:64 service=http/server component=query msg="internal server is shutting down" err=null 2021-03-02T09:57:45.543962653Z level=warn ts=2021-03-02T09:57:45.543891219Z caller=storeset.go:488 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.16.118:10901: rpc error: code = Canceled desc = context canceled" address=10.129.16.118:10901 2021-03-02T09:57:45.544014380Z level=warn ts=2021-03-02T09:57:45.54396198Z caller=storeset.go:488 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.130.8.29:10901: rpc error: code = Canceled desc = context canceled" address=10.130.8.29:10901 2021-03-02T09:57:45.544059905Z level=warn ts=2021-03-02T09:57:45.544008906Z caller=runutil.go:123 component=storeset msg="detected close error" err="store 10.129.16.118:10901 connection close: rpc error: code = Canceled desc = grpc: the client connection is closing" 2021-03-02T09:57:45.544086547Z level=info ts=2021-03-02T09:57:45.544064502Z caller=storeset.go:389 component=storeset msg="removing store because it's unhealthy or does not exist" address=10.129.16.118:10901 extLset="{prometheus=\"openshift-monitoring/k8s\", prometheus_replica=\"prometheus-k8s-1\"}" 2021-03-02T09:57:45.544108471Z level=warn ts=2021-03-02T09:57:45.544087378Z caller=runutil.go:123 component=storeset msg="detected close error" err="store 10.130.8.29:10901 connection close: rpc error: code = Canceled desc = grpc: the client connection is closing" 2021-03-02T09:57:45.544130510Z level=info ts=2021-03-02T09:57:45.544109834Z caller=storeset.go:389 component=storeset msg="removing store because it's unhealthy or does not exist" address=10.130.8.29:10901 extLset="{prometheus=\"openshift-monitoring/k8s\", prometheus_replica=\"prometheus-k8s-0\"}" Expected results: Additional info: I will post a link for the must-gather
Issue began occurring on AWS cluster after changing machine types from m5.xlarge (16 GB RAM) to r5.xlarge (32 GB RAM) Looking at the Thanos sidecar in the Prometheus pod, I see for a while we were getting 503 messages: 2021-03-01T10:41:30.726676148Z level=warn ts=2021-03-01T10:41:30.726601548Z caller=sidecar.go:156 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable" 2021-03-01T10:41:30.726772389Z level=warn ts=2021-03-01T10:41:30.726747564Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable" After some time, these changed to context deadline exceeded messages, which I believe means timeout/performance issues: 2021-03-01T10:41:33.368760942Z level=info ts=2021-03-01T10:41:33.36873604Z caller=intrumentation.go:48 msg="changing probe status" status=ready 2021-03-01T10:42:08.391325504Z level=warn ts=2021-03-01T10:42:08.381614768Z caller=sidecar.go:189 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded" 2021-03-01T10:48:09.018580750Z level=warn ts=2021-03-01T10:48:09.001576599Z caller=sidecar.go:189 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded" I see some errors in Prometheus itself which might be of some interest. In prometheus-k8s-0: 2021-03-01T11:19:19.388701979Z level=warn ts=2021-03-01T11:19:19.388Z caller=manager.go:641 component="rule manager" group=k8s.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=78 2021-03-01T11:19:49.199923767Z level=warn ts=2021-03-01T11:19:49.199Z caller=manager.go:644 component="rule manager" group=k8s.rules msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=30 In prometheus-k8s-1: 2021-03-01T12:15:35.856362422Z 2021/03/01 12:15:35 oauthproxy.go:785: basicauth: 10.128.16.8:58446 Authorization header does not start with 'Basic', skipping basic authentication 2021-03-01T12:20:07.628059556Z 2021/03/01 12:20:07 oauthproxy.go:785: basicauth: 10.128.16.8:32806 Authorization header does not start with 'Basic', skipping basic authentication
checked with 4.8.0-0.nightly-2021-03-30-023016, alert rules for duration are all 1h, and severity is warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438