Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1982757

Summary:	thanos querier container restarting
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Monitoring	Assignee:	Simon Pasquier <spasquie>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	alegrand, amuller, anpicker, aos-bugs, erooth, kakkoyun, pgough, pkrupa
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-15 16:45:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steven Walter 2021-07-15 15:55:51 UTC

Description of problem:
Thanos Querier container is restarting periodically. When it occurs, the livenessprobe times out causing the restart:


  Warning  BackOff    66m (x3403 over 2d20h)    kubelet  Back-off restarting failed container
  Warning  Unhealthy  5m50s (x4243 over 2d21h)  kubelet  Liveness probe failed: command timed out
  Warning  Unhealthy  107s (x4198 over 2d21h)   kubelet  Readiness probe failed: command timed out


Version-Release number of selected component (if applicable):
4.7

How reproducible:
Unconfirmed


Actual results:

Thanos-querier logs appear healthy up to the point of failure:

2021-07-14T19:39:31.614681877Z level=info ts=2021-07-14T19:39:31.614648829Z caller=storeset.go:384 component=storeset msg="adding new rulesAPI to query storeset" address=10.130.2.29:10901
2021-07-14T19:39:31.614711403Z level=info ts=2021-07-14T19:39:31.614671353Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=10.130.2.29:10901 extLset="{prometheus=\"openshift-user-workload-monitoring/user-workload\", prometheus_replica=\"prometheus-user-workload-1\"}"
2021-07-14T19:44:51.516582248Z level=info ts=2021-07-14T19:44:51.516452481Z caller=main.go:168 msg="caught signal. Exiting." signal=terminated

prometheus-k8s-0 and -k8s-1 logs seem normal. The only suspicious thing is, a few hours previously, they show heartbeat failures:

2021-07-14T16:11:32.405879906Z level=warn ts=2021-07-14T16:11:32.405614686Z caller=sidecar.go:179 msg="heartbeat failed" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": context deadline exceeded"

But these do not continue.


Let us know if you need any data as I am not sure how else to troubleshoot thanos querier.

Comment 1 Philip Gough 2021-07-15 16:45:51 UTC


*** This bug has been marked as a duplicate of bug 1980888 ***