Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1982757

Summary: thanos querier container restarting
Product: OpenShift Container Platform Reporter: Steven Walter <stwalter>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.7CC: alegrand, amuller, anpicker, aos-bugs, erooth, kakkoyun, pgough, pkrupa
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-15 16:45:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Walter 2021-07-15 15:55:51 UTC
Description of problem:
Thanos Querier container is restarting periodically. When it occurs, the livenessprobe times out causing the restart:


  Warning  BackOff    66m (x3403 over 2d20h)    kubelet  Back-off restarting failed container
  Warning  Unhealthy  5m50s (x4243 over 2d21h)  kubelet  Liveness probe failed: command timed out
  Warning  Unhealthy  107s (x4198 over 2d21h)   kubelet  Readiness probe failed: command timed out


Version-Release number of selected component (if applicable):
4.7

How reproducible:
Unconfirmed


Actual results:

Thanos-querier logs appear healthy up to the point of failure:

2021-07-14T19:39:31.614681877Z level=info ts=2021-07-14T19:39:31.614648829Z caller=storeset.go:384 component=storeset msg="adding new rulesAPI to query storeset" address=10.130.2.29:10901
2021-07-14T19:39:31.614711403Z level=info ts=2021-07-14T19:39:31.614671353Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=10.130.2.29:10901 extLset="{prometheus=\"openshift-user-workload-monitoring/user-workload\", prometheus_replica=\"prometheus-user-workload-1\"}"
2021-07-14T19:44:51.516582248Z level=info ts=2021-07-14T19:44:51.516452481Z caller=main.go:168 msg="caught signal. Exiting." signal=terminated

prometheus-k8s-0 and -k8s-1 logs seem normal. The only suspicious thing is, a few hours previously, they show heartbeat failures:

2021-07-14T16:11:32.405879906Z level=warn ts=2021-07-14T16:11:32.405614686Z caller=sidecar.go:179 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded"

But these do not continue.


Let us know if you need any data as I am not sure how else to troubleshoot thanos querier.

Comment 1 Philip Gough 2021-07-15 16:45:51 UTC

*** This bug has been marked as a duplicate of bug 1980888 ***