Bug 1934163
Summary: | Thanos Querier restarting and gettin alert ThanosQueryHttpRequestQueryRangeErrorRateHigh | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Steven Walter <stwalter> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.6 | CC: | alegrand, anpicker, bshirren, cedric.girard, cruhm, erooth, kakkoyun, lcosic, mmohan, mrobson, pkrupa, rugouvei, spasquie |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:49:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Steven Walter
2021-03-02 15:57:10 UTC
Issue began occurring on AWS cluster after changing machine types from m5.xlarge (16 GB RAM) to r5.xlarge (32 GB RAM) Looking at the Thanos sidecar in the Prometheus pod, I see for a while we were getting 503 messages: 2021-03-01T10:41:30.726676148Z level=warn ts=2021-03-01T10:41:30.726601548Z caller=sidecar.go:156 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable" 2021-03-01T10:41:30.726772389Z level=warn ts=2021-03-01T10:41:30.726747564Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable" After some time, these changed to context deadline exceeded messages, which I believe means timeout/performance issues: 2021-03-01T10:41:33.368760942Z level=info ts=2021-03-01T10:41:33.36873604Z caller=intrumentation.go:48 msg="changing probe status" status=ready 2021-03-01T10:42:08.391325504Z level=warn ts=2021-03-01T10:42:08.381614768Z caller=sidecar.go:189 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded" 2021-03-01T10:48:09.018580750Z level=warn ts=2021-03-01T10:48:09.001576599Z caller=sidecar.go:189 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded" I see some errors in Prometheus itself which might be of some interest. In prometheus-k8s-0: 2021-03-01T11:19:19.388701979Z level=warn ts=2021-03-01T11:19:19.388Z caller=manager.go:641 component="rule manager" group=k8s.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=78 2021-03-01T11:19:49.199923767Z level=warn ts=2021-03-01T11:19:49.199Z caller=manager.go:644 component="rule manager" group=k8s.rules msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=30 In prometheus-k8s-1: 2021-03-01T12:15:35.856362422Z 2021/03/01 12:15:35 oauthproxy.go:785: basicauth: 10.128.16.8:58446 Authorization header does not start with 'Basic', skipping basic authentication 2021-03-01T12:20:07.628059556Z 2021/03/01 12:20:07 oauthproxy.go:785: basicauth: 10.128.16.8:32806 Authorization header does not start with 'Basic', skipping basic authentication checked with 4.8.0-0.nightly-2021-03-30-023016, alert rules for duration are all 1h, and severity is warning Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |