Bug 1942913 - ThanosSidecarUnhealthy isn't resilient to WAL replays.
Summary: ThanosSidecarUnhealthy isn't resilient to WAL replays.
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.10.0
Assignee: Arunprasad Rajkumar
QA Contact: hongyan li
Depends On:
TreeView+ depends on / blocked
Reported: 2021-03-25 10:29 UTC by Damien Grisonnet
Modified: 2022-03-12 04:35 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When Prometheus restarts, it replays WAL to avoid data loss, however when WAL takes too long to complete the replay ThanosSidecarUnhealthy would be firing which is false positive. Consequence: As false positive alert ThanosSidecarUnhealthy will be fired Fix: Make use of a metric `prometheus_tsdb_data_replay_duration_seconds` from Prometheus TSDB and fire ThanosSidecarUnhealthy only when the above said metric is not yet set. Result: False positive firing of alert `ThanosSidecarUnhealthy` has been avoided.
Clone Of:
Last Closed: 2022-03-12 04:34:58 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1399 0 None open Bug 1942913: Make ThanosSidecarNoConnectionToStartedPrometheus resilient to WAL replays 2021-09-27 10:51:34 UTC
Github prometheus-operator kube-prometheus pull 1399 0 None open thanos: bump to latest and add `thanosPrometheusCommonDimensions` 2021-09-27 06:38:32 UTC
Github thanos-io thanos issues 3915 0 None open ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay 2021-03-25 10:29:40 UTC
Github thanos-io thanos pull 4508 0 None None None 2021-08-02 11:52:02 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:35:19 UTC

Description Damien Grisonnet 2021-03-25 10:29:40 UTC
This bug was initially created as a copy of Bug #1921335

I am copying this bug because: 

The ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts aren't resilient to WAL replays which mean that they might fire during OCP upgrades and there will be nothing to do for the administrator to resolve the alert aside from waiting.

Description of problem:
After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally.

[FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical)

Version-Release number of selected component (if applicable):

oc --context build01 get clusterversion
version   4.7.0-fc.3   True        False         8d      Cluster version is 4.7.0-fc.3

How reproducible:

Steps to Reproduce:

Actual results:

name: ThanosSidecarUnhealthy
expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600
severity: critical
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
summary: Thanos Sidecar is unhealthy.

Expected results:
a document explaining what a cluster admin need to do while seeing this alert (since it is critial)

Additional info:
Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation.

Comment 7 Arunprasad Rajkumar 2021-09-03 10:43:45 UTC
Upstream fix is in review state.

Comment 11 Junqi Zhao 2021-10-08 08:00:44 UTC
checked with 4.10.0-0.nightly-2021-10-07-212540, ThanosSidecarUnhealthy is renamed to ThanosSidecarNoConnectionToStartedPrometheus, and no such alert from CI jobs

   - alert: ThanosSidecarNoConnectionToStartedPrometheus
        description: Thanos Sidecar {{$labels.instance}} is unhealthy.
        summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems
          healthy and has reloaded WAL.
      expr: |
        thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0
        AND on (namespace, pod)
        prometheus_tsdb_data_replay_duration_seconds != 0
      for: 1h
        severity: warning

Comment 16 errata-xmlrpc 2022-03-12 04:34:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.