Bug 1982757
| Summary: | thanos querier container restarting | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steven Walter <stwalter> |
| Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
| Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | alegrand, amuller, anpicker, aos-bugs, erooth, kakkoyun, pgough, pkrupa |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-15 16:45:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
*** This bug has been marked as a duplicate of bug 1980888 *** |
Description of problem: Thanos Querier container is restarting periodically. When it occurs, the livenessprobe times out causing the restart: Warning BackOff 66m (x3403 over 2d20h) kubelet Back-off restarting failed container Warning Unhealthy 5m50s (x4243 over 2d21h) kubelet Liveness probe failed: command timed out Warning Unhealthy 107s (x4198 over 2d21h) kubelet Readiness probe failed: command timed out Version-Release number of selected component (if applicable): 4.7 How reproducible: Unconfirmed Actual results: Thanos-querier logs appear healthy up to the point of failure: 2021-07-14T19:39:31.614681877Z level=info ts=2021-07-14T19:39:31.614648829Z caller=storeset.go:384 component=storeset msg="adding new rulesAPI to query storeset" address=10.130.2.29:10901 2021-07-14T19:39:31.614711403Z level=info ts=2021-07-14T19:39:31.614671353Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=10.130.2.29:10901 extLset="{prometheus=\"openshift-user-workload-monitoring/user-workload\", prometheus_replica=\"prometheus-user-workload-1\"}" 2021-07-14T19:44:51.516582248Z level=info ts=2021-07-14T19:44:51.516452481Z caller=main.go:168 msg="caught signal. Exiting." signal=terminated prometheus-k8s-0 and -k8s-1 logs seem normal. The only suspicious thing is, a few hours previously, they show heartbeat failures: 2021-07-14T16:11:32.405879906Z level=warn ts=2021-07-14T16:11:32.405614686Z caller=sidecar.go:179 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded" But these do not continue. Let us know if you need any data as I am not sure how else to troubleshoot thanos querier.