Bug 1921335

Summary: ThanosSidecarUnhealthy
Product: OpenShift Container Platform Reporter: Hongkai Liu <hongkliu>
Component: MonitoringAssignee: Damien Grisonnet <dgrisonn>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: alegrand, anpicker, ccoleman, dgrisonn, erooth, hongyli, kakkoyun, lcosic, pkrupa, skuznets, spasquie, wking
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1943565 (view as bug list) Environment:
Last Closed: 2021-07-27 22:37:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hongkai Liu 2021-01-27 22:06:18 UTC
Description of problem:
After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally.

https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900
[FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical)



Version-Release number of selected component (if applicable):

oc --context build01 get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.3   True        False         8d      Cluster version is 4.7.0-fc.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 

name: ThanosSidecarUnhealthy
expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600
labels:
severity: critical
annotations:
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
summary: Thanos Sidecar is unhealthy.

Expected results:
a document explaining what a cluster admin need to do while seeing this alert (since it is critial)

Additional info:
Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation.
https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200

Comment 4 Hongkai Liu 2021-03-03 16:36:27 UTC
The alert fired during upgrading build01 from 4.7.0-rc.1 to 4.7.0.
https://coreos.slack.com/archives/CHY2E1BL4/p1614788698069200

Comment 5 Damien Grisonnet 2021-03-04 09:21:36 UTC
Considering the critically and frequency of this alert, I'm elevating it to medium/medium, and prioritize this BZ in the current sprint.

Comment 6 Damien Grisonnet 2021-03-17 13:35:08 UTC
I added a link to a discussion we started upstream to make the `ThanosSidecarUnhealthy` and `ThanosSidecarPrometheusDown` alerts resilient to WAL replays.

I also linked a recent fix that prevents the `ThanosSidecarUnhealthy` alert to fire instantly during upgrades. Once brought downstream, this should fix the failures we are often seeing in CI.

Comment 7 Simon Pasquier 2021-03-18 09:01:54 UTC
*** Bug 1940262 has been marked as a duplicate of this bug. ***

Comment 10 Clayton Coleman 2021-03-24 16:10:27 UTC
10% of CI runs fail on this alert

Comment 11 Damien Grisonnet 2021-03-25 10:38:13 UTC
The PR attached to this BZ should fix the issue we've seen in CI where the alert is firing straight away instead of after 10 minutes. It also readjust the Thanos Sidecar alerts by decreasing their severity to `warning` and increasing their duration to 1 hour as per recent discussions around alerting in OCP.

However, considering the urgency with the CI failures, it does not make the Thanos sidecar alerts more resilient to WAL replays as this is still in progress. Thus, I created https://bugzilla.redhat.com/show_bug.cgi?id=1942913 to track this effort.

Comment 13 Steve Kuznetsov 2021-03-25 17:57:53 UTC
Will this be backported to 4.7? We are still seeing this on every 4.7 z stream upgrade.

Comment 14 hongyan li 2021-03-26 01:32:29 UTC
Test with payload 4.8.0-0.nightly-2021-03-25-160359
The issue is fixed by thanos-sidecar rule as the following, related alerts have severity warning and during 1h

oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 20 thanos-sidecar
  - name: thanos-sidecar
    rules:
    - alert: ThanosSidecarPrometheusDown
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} cannot connect to Prometheus.
        summary: Thanos Sidecar cannot connect to Prometheus
      expr: |
        sum by (job, instance) (thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0)
      for: 1h
      labels:
        severity: warning
    - alert: ThanosSidecarBucketOperationsFailed
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} bucket operations are failing
        summary: Thanos Sidecar bucket operations are failing
      expr: |
        rate(thanos_objstore_bucket_operation_failures_total{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}[5m]) > 0
      for: 1h
      labels:
        severity: warning
    - alert: ThanosSidecarUnhealthy
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds.
        summary: Thanos Sidecar is unhealthy.
      expr: |
        time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240
      for: 1h
      labels:
        severity: warning

Comment 17 errata-xmlrpc 2021-07-27 22:37:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438