1942913 – ThanosSidecarUnhealthy isn't resilient to WAL replays.

Bug 1942913 - ThanosSidecarUnhealthy isn't resilient to WAL replays.

Summary: ThanosSidecarUnhealthy isn't resilient to WAL replays.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Arunprasad Rajkumar
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-25 10:29 UTC by Damien Grisonnet
Modified:	2022-03-12 04:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When Prometheus restarts, it replays WAL to avoid data loss, however when WAL takes too long to complete the replay ThanosSidecarUnhealthy would be firing which is false positive. Consequence: As false positive alert ThanosSidecarUnhealthy will be fired Fix: Make use of a metric `prometheus_tsdb_data_replay_duration_seconds` from Prometheus TSDB and fire ThanosSidecarUnhealthy only when the above said metric is not yet set. Result: False positive firing of alert `ThanosSidecarUnhealthy` has been avoided.
Clone Of:
Environment:
Last Closed:	2022-03-12 04:34:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1399	None	open	Bug 1942913: Make ThanosSidecarNoConnectionToStartedPrometheus resilient to WAL replays	2021-09-27 10:51:34 UTC
Github	prometheus-operator kube-prometheus pull 1399	None	open	thanos: bump to latest and add `thanosPrometheusCommonDimensions`	2021-09-27 06:38:32 UTC
Github	thanos-io thanos issues 3915	None	open	ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay	2021-03-25 10:29:40 UTC
Github	thanos-io thanos pull 4508	None	None	None	2021-08-02 11:52:02 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-12 04:35:19 UTC

Description Damien Grisonnet 2021-03-25 10:29:40 UTC

This bug was initially created as a copy of Bug #1921335

I am copying this bug because: 

The ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts aren't resilient to WAL replays which mean that they might fire during OCP upgrades and there will be nothing to do for the administrator to resolve the alert aside from waiting.

Description of problem:
After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally.

https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900
[FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical)



Version-Release number of selected component (if applicable):

oc --context build01 get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.3   True        False         8d      Cluster version is 4.7.0-fc.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 

name: ThanosSidecarUnhealthy
expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600
labels:
severity: critical
annotations:
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
summary: Thanos Sidecar is unhealthy.

Expected results:
a document explaining what a cluster admin need to do while seeing this alert (since it is critial)

Additional info:
Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation.
https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200

Comment 7 Arunprasad Rajkumar 2021-09-03 10:43:45 UTC

Upstream fix is in review state.

Comment 11 Junqi Zhao 2021-10-08 08:00:44 UTC

checked with 4.10.0-0.nightly-2021-10-07-212540, ThanosSidecarUnhealthy is renamed to ThanosSidecarNoConnectionToStartedPrometheus, and no such alert from CI jobs
https://search.ci.openshift.org/?search=ThanosSidecarNoConnectionToStartedPrometheus&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

   - alert: ThanosSidecarNoConnectionToStartedPrometheus
      annotations:
        description: Thanos Sidecar {{$labels.instance}} is unhealthy.
        summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems
          healthy and has reloaded WAL.
      expr: |
        thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0
        AND on (namespace, pod)
        prometheus_tsdb_data_replay_duration_seconds != 0
      for: 1h
      labels:
        severity: warning

Comment 16 errata-xmlrpc 2022-03-12 04:34:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.