1921335 – ThanosSidecarUnhealthy

Bug 1921335 - ThanosSidecarUnhealthy

Summary: ThanosSidecarUnhealthy

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Damien Grisonnet
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1940262 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-27 22:06 UTC by Hongkai Liu
Modified:	2021-07-27 22:37 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1943565 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:37:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1090	None	open	Bug 1921335: Fix and adjust ThanosSidecarUnhealthy alert	2021-03-24 17:08:55 UTC
Github	thanos-io thanos issues 3915	None	open	ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay	2021-03-17 13:35:06 UTC
Github	thanos-io thanos pull 3204	None	closed	mixin: Use sidecar's metric timestamp for healthcheck	2021-03-17 13:35:06 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:37:37 UTC

Description Hongkai Liu 2021-01-27 22:06:18 UTC

Description of problem:
After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally.

https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900
[FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical)



Version-Release number of selected component (if applicable):

oc --context build01 get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.3   True        False         8d      Cluster version is 4.7.0-fc.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 

name: ThanosSidecarUnhealthy
expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600
labels:
severity: critical
annotations:
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
summary: Thanos Sidecar is unhealthy.

Expected results:
a document explaining what a cluster admin need to do while seeing this alert (since it is critial)

Additional info:
Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation.
https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200

Comment 4 Hongkai Liu 2021-03-03 16:36:27 UTC

The alert fired during upgrading build01 from 4.7.0-rc.1 to 4.7.0.
https://coreos.slack.com/archives/CHY2E1BL4/p1614788698069200

Comment 5 Damien Grisonnet 2021-03-04 09:21:36 UTC

Considering the critically and frequency of this alert, I'm elevating it to medium/medium, and prioritize this BZ in the current sprint.

Comment 6 Damien Grisonnet 2021-03-17 13:35:08 UTC

I added a link to a discussion we started upstream to make the `ThanosSidecarUnhealthy` and `ThanosSidecarPrometheusDown` alerts resilient to WAL replays.

I also linked a recent fix that prevents the `ThanosSidecarUnhealthy` alert to fire instantly during upgrades. Once brought downstream, this should fix the failures we are often seeing in CI.

Comment 7 Simon Pasquier 2021-03-18 09:01:54 UTC

*** Bug 1940262 has been marked as a duplicate of this bug. ***

Comment 10 Clayton Coleman 2021-03-24 16:10:27 UTC

10% of CI runs fail on this alert

Comment 11 Damien Grisonnet 2021-03-25 10:38:13 UTC

The PR attached to this BZ should fix the issue we've seen in CI where the alert is firing straight away instead of after 10 minutes. It also readjust the Thanos Sidecar alerts by decreasing their severity to `warning` and increasing their duration to 1 hour as per recent discussions around alerting in OCP.

However, considering the urgency with the CI failures, it does not make the Thanos sidecar alerts more resilient to WAL replays as this is still in progress. Thus, I created https://bugzilla.redhat.com/show_bug.cgi?id=1942913 to track this effort.

Comment 13 Steve Kuznetsov 2021-03-25 17:57:53 UTC

Will this be backported to 4.7? We are still seeing this on every 4.7 z stream upgrade.

Comment 14 hongyan li 2021-03-26 01:32:29 UTC

Test with payload 4.8.0-0.nightly-2021-03-25-160359
The issue is fixed by thanos-sidecar rule as the following, related alerts have severity warning and during 1h

oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 20 thanos-sidecar
  - name: thanos-sidecar
    rules:
    - alert: ThanosSidecarPrometheusDown
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} cannot connect to Prometheus.
        summary: Thanos Sidecar cannot connect to Prometheus
      expr: |
        sum by (job, instance) (thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0)
      for: 1h
      labels:
        severity: warning
    - alert: ThanosSidecarBucketOperationsFailed
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} bucket operations are failing
        summary: Thanos Sidecar bucket operations are failing
      expr: |
        rate(thanos_objstore_bucket_operation_failures_total{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}[5m]) > 0
      for: 1h
      labels:
        severity: warning
    - alert: ThanosSidecarUnhealthy
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds.
        summary: Thanos Sidecar is unhealthy.
      expr: |
        time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240
      for: 1h
      labels:
        severity: warning

Comment 17 errata-xmlrpc 2021-07-27 22:37:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.