Bug 1942913

Summary: ThanosSidecarUnhealthy isn't resilient to WAL replays.
Product: OpenShift Container Platform Reporter: Damien Grisonnet <dgrisonn>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: anpicker, erooth, hongyli, juzhao, spasquie, wking
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When Prometheus restarts, it replays WAL to avoid data loss, however when WAL takes too long to complete the replay ThanosSidecarUnhealthy would be firing which is false positive. Consequence: As false positive alert ThanosSidecarUnhealthy will be fired Fix: Make use of a metric `prometheus_tsdb_data_replay_duration_seconds` from Prometheus TSDB and fire ThanosSidecarUnhealthy only when the above said metric is not yet set. Result: False positive firing of alert `ThanosSidecarUnhealthy` has been avoided.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:34:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Damien Grisonnet 2021-03-25 10:29:40 UTC
This bug was initially created as a copy of Bug #1921335

I am copying this bug because: 

The ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts aren't resilient to WAL replays which mean that they might fire during OCP upgrades and there will be nothing to do for the administrator to resolve the alert aside from waiting.

Description of problem:
After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally.

https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900
[FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical)



Version-Release number of selected component (if applicable):

oc --context build01 get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.3   True        False         8d      Cluster version is 4.7.0-fc.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 

name: ThanosSidecarUnhealthy
expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600
labels:
severity: critical
annotations:
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
summary: Thanos Sidecar is unhealthy.

Expected results:
a document explaining what a cluster admin need to do while seeing this alert (since it is critial)

Additional info:
Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation.
https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200

Comment 7 Arunprasad Rajkumar 2021-09-03 10:43:45 UTC
Upstream fix is in review state.

Comment 11 Junqi Zhao 2021-10-08 08:00:44 UTC
checked with 4.10.0-0.nightly-2021-10-07-212540, ThanosSidecarUnhealthy is renamed to ThanosSidecarNoConnectionToStartedPrometheus, and no such alert from CI jobs
https://search.ci.openshift.org/?search=ThanosSidecarNoConnectionToStartedPrometheus&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

   - alert: ThanosSidecarNoConnectionToStartedPrometheus
      annotations:
        description: Thanos Sidecar {{$labels.instance}} is unhealthy.
        summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems
          healthy and has reloaded WAL.
      expr: |
        thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0
        AND on (namespace, pod)
        prometheus_tsdb_data_replay_duration_seconds != 0
      for: 1h
      labels:
        severity: warning

Comment 16 errata-xmlrpc 2022-03-12 04:34:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056