Bug 1921335 - ThanosSidecarUnhealthy
Summary: ThanosSidecarUnhealthy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Damien Grisonnet
QA Contact: hongyan li
URL:
Whiteboard:
: 1940262 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-27 22:06 UTC by Hongkai Liu
Modified: 2021-07-27 22:37 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1943565 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:37:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1090 0 None open Bug 1921335: Fix and adjust ThanosSidecarUnhealthy alert 2021-03-24 17:08:55 UTC
Github thanos-io thanos issues 3915 0 None open ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay 2021-03-17 13:35:06 UTC
Github thanos-io thanos pull 3204 0 None closed mixin: Use sidecar's metric timestamp for healthcheck 2021-03-17 13:35:06 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:37:37 UTC

Description Hongkai Liu 2021-01-27 22:06:18 UTC
Description of problem:
After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally.

https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900
[FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical)



Version-Release number of selected component (if applicable):

oc --context build01 get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.3   True        False         8d      Cluster version is 4.7.0-fc.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 

name: ThanosSidecarUnhealthy
expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600
labels:
severity: critical
annotations:
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
summary: Thanos Sidecar is unhealthy.

Expected results:
a document explaining what a cluster admin need to do while seeing this alert (since it is critial)

Additional info:
Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation.
https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200

Comment 4 Hongkai Liu 2021-03-03 16:36:27 UTC
The alert fired during upgrading build01 from 4.7.0-rc.1 to 4.7.0.
https://coreos.slack.com/archives/CHY2E1BL4/p1614788698069200

Comment 5 Damien Grisonnet 2021-03-04 09:21:36 UTC
Considering the critically and frequency of this alert, I'm elevating it to medium/medium, and prioritize this BZ in the current sprint.

Comment 6 Damien Grisonnet 2021-03-17 13:35:08 UTC
I added a link to a discussion we started upstream to make the `ThanosSidecarUnhealthy` and `ThanosSidecarPrometheusDown` alerts resilient to WAL replays.

I also linked a recent fix that prevents the `ThanosSidecarUnhealthy` alert to fire instantly during upgrades. Once brought downstream, this should fix the failures we are often seeing in CI.

Comment 7 Simon Pasquier 2021-03-18 09:01:54 UTC
*** Bug 1940262 has been marked as a duplicate of this bug. ***

Comment 10 Clayton Coleman 2021-03-24 16:10:27 UTC
10% of CI runs fail on this alert

Comment 11 Damien Grisonnet 2021-03-25 10:38:13 UTC
The PR attached to this BZ should fix the issue we've seen in CI where the alert is firing straight away instead of after 10 minutes. It also readjust the Thanos Sidecar alerts by decreasing their severity to `warning` and increasing their duration to 1 hour as per recent discussions around alerting in OCP.

However, considering the urgency with the CI failures, it does not make the Thanos sidecar alerts more resilient to WAL replays as this is still in progress. Thus, I created https://bugzilla.redhat.com/show_bug.cgi?id=1942913 to track this effort.

Comment 13 Steve Kuznetsov 2021-03-25 17:57:53 UTC
Will this be backported to 4.7? We are still seeing this on every 4.7 z stream upgrade.

Comment 14 hongyan li 2021-03-26 01:32:29 UTC
Test with payload 4.8.0-0.nightly-2021-03-25-160359
The issue is fixed by thanos-sidecar rule as the following, related alerts have severity warning and during 1h

oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 20 thanos-sidecar
  - name: thanos-sidecar
    rules:
    - alert: ThanosSidecarPrometheusDown
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} cannot connect to Prometheus.
        summary: Thanos Sidecar cannot connect to Prometheus
      expr: |
        sum by (job, instance) (thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0)
      for: 1h
      labels:
        severity: warning
    - alert: ThanosSidecarBucketOperationsFailed
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} bucket operations are failing
        summary: Thanos Sidecar bucket operations are failing
      expr: |
        rate(thanos_objstore_bucket_operation_failures_total{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}[5m]) > 0
      for: 1h
      labels:
        severity: warning
    - alert: ThanosSidecarUnhealthy
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds.
        summary: Thanos Sidecar is unhealthy.
      expr: |
        time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240
      for: 1h
      labels:
        severity: warning

Comment 17 errata-xmlrpc 2021-07-27 22:37:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.