Bug 1943565

Summary: ThanosSidecarUnhealthy
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: alegrand, anpicker, ccoleman, dgrisonn, erooth, hongkliu, hongyli, kakkoyun, lcosic, pkrupa, skuznets, spasquie, wking
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1921335 Environment:
Last Closed: 2021-08-10 11:27:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1955586    
Bug Blocks:    

Description Simon Pasquier 2021-03-26 13:20:02 UTC
+++ This bug was initially created as a clone of Bug #1921335 +++

Description of problem:
After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally.

https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900
[FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical)



Version-Release number of selected component (if applicable):

oc --context build01 get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.3   True        False         8d      Cluster version is 4.7.0-fc.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 

name: ThanosSidecarUnhealthy
expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600
labels:
severity: critical
annotations:
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
summary: Thanos Sidecar is unhealthy.

Expected results:
a document explaining what a cluster admin need to do while seeing this alert (since it is critial)

Additional info:
Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation.
https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200

--- Additional comment from Red Hat Bugzilla on 2021-02-04 09:57:10 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Damien Grisonnet on 2021-02-08 15:59:03 UTC ---

No time to work on this bug this sprint, because of higher priority bugs and low team capacity.

--- Additional comment from Damien Grisonnet on 2021-03-02 14:31:56 UTC ---

No time to work on this bug this sprint, because of higher priority bugs and low team capacity.

--- Additional comment from Hongkai Liu on 2021-03-03 16:36:27 UTC ---

The alert fired during upgrading build01 from 4.7.0-rc.1 to 4.7.0.
https://coreos.slack.com/archives/CHY2E1BL4/p1614788698069200

--- Additional comment from Damien Grisonnet on 2021-03-04 09:21:36 UTC ---

Considering the critically and frequency of this alert, I'm elevating it to medium/medium, and prioritize this BZ in the current sprint.

--- Additional comment from Damien Grisonnet on 2021-03-17 13:35:08 UTC ---

I added a link to a discussion we started upstream to make the `ThanosSidecarUnhealthy` and `ThanosSidecarPrometheusDown` alerts resilient to WAL replays.

I also linked a recent fix that prevents the `ThanosSidecarUnhealthy` alert to fire instantly during upgrades. Once brought downstream, this should fix the failures we are often seeing in CI.

--- Additional comment from Simon Pasquier on 2021-03-18 09:01:54 UTC ---



--- Additional comment from Simon Pasquier on 2021-03-18 09:17:35 UTC ---

Moved severity to high to align with bug 1940262.

--- Additional comment from Gabe Montero on 2021-03-22 18:28:35 UTC ---

If it helps, I'm seeing this with some consistency in a master branch openshift-apiserver PR I have (2 failures and 2 passing in my last 4 e2e-aws-serial runs and the time of this comment).

Failures:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/191/pull-ci-openshift-openshift-apiserver-master-e2e-aws-serial/1373992365517180928
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/191/pull-ci-openshift-openshift-apiserver-master-e2e-aws-serial/1372968169047592960

Successes:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/192/pull-ci-openshift-openshift-apiserver-master-e2e-aws-serial/1373264591286439936
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/192/pull-ci-openshift-openshift-apiserver-master-e2e-aws-serial/1371487375661731840

Some of the known upgrade issues are also getting in the way of my PR, so this is not a sole blocker, but depending on how things evolve over the next week, it could become one if
it continues to occur frequently enough.

--- Additional comment from Clayton Coleman on 2021-03-24 16:10:27 UTC ---

10% of CI runs fail on this alert

--- Additional comment from Damien Grisonnet on 2021-03-25 10:38:13 UTC ---

The PR attached to this BZ should fix the issue we've seen in CI where the alert is firing straight away instead of after 10 minutes. It also readjust the Thanos Sidecar alerts by decreasing their severity to `warning` and increasing their duration to 1 hour as per recent discussions around alerting in OCP.

However, considering the urgency with the CI failures, it does not make the Thanos sidecar alerts more resilient to WAL replays as this is still in progress. Thus, I created https://bugzilla.redhat.com/show_bug.cgi?id=1942913 to track this effort.

--- Additional comment from OpenShift Automated Release Tooling on 2021-03-25 11:56:51 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.

--- Additional comment from Steve Kuznetsov on 2021-03-25 17:57:53 UTC ---

Will this be backported to 4.7? We are still seeing this on every 4.7 z stream upgrade.

--- Additional comment from hongyan li on 2021-03-26 01:32:29 UTC ---

Test with payload 4.8.0-0.nightly-2021-03-25-160359
The issue is fixed by thanos-sidecar rule as the following, related alerts have severity warning and during 1h

oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 20 thanos-sidecar
  - name: thanos-sidecar
    rules:
    - alert: ThanosSidecarPrometheusDown
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} cannot connect to Prometheus.
        summary: Thanos Sidecar cannot connect to Prometheus
      expr: |
        sum by (job, instance) (thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0)
      for: 1h
      labels:
        severity: warning
    - alert: ThanosSidecarBucketOperationsFailed
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} bucket operations are failing
        summary: Thanos Sidecar bucket operations are failing
      expr: |
        rate(thanos_objstore_bucket_operation_failures_total{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}[5m]) > 0
      for: 1h
      labels:
        severity: warning
    - alert: ThanosSidecarUnhealthy
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds.
        summary: Thanos Sidecar is unhealthy.
      expr: |
        time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240
      for: 1h
      labels:
        severity: warning

Comment 1 hongyan li 2021-03-28 02:23:10 UTC
Test with cluster-bot and the PR, issue fixed

oc get prometheusrules prometheus-k8s-rules -n openshift-monitoring -oyaml|grep -A 10 ThanosSidecarUnhealthy
    - alert: ThanosSidecarUnhealthy
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
        summary: Thanos Sidecar is unhealthy.
      expr: |
        time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240
      labels:
        severity: critical

Comment 3 Damien Grisonnet 2021-04-30 12:49:24 UTC
I closed the backport PR as the fix for bug 1921335 includes a regression: https://github.com/thanos-io/thanos/issues/3990.

Comment 4 Damien Grisonnet 2021-04-30 12:56:09 UTC
The resolution of this bug depends on bug 1955586, so we can't make any progress on it until it is resolved.

Comment 9 hongyan li 2021-07-29 09:16:02 UTC
Test with payload 4.8.0-0.nightly-2021-07-29-033031, alert rule changed and query data make sense.

$ oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 10 ThanosSidecarUnhealthy
    - alert: ThanosSidecarUnhealthy
      annotations:
        description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for
          more than {{ $value }} seconds.
        summary: Thanos Sidecar is unhealthy.
      expr: |
        time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job,pod) >= 240
      for: 1h
      labels:
        severity: warning
$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query --data-urlencode query='time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job,pod)' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   481    0   306  100   175   9562   5468 --:--:-- --:--:-- --:--:-- 15516
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "job": "prometheus-k8s-thanos-sidecar",
          "pod": "prometheus-k8s-0"
        },
        "value": [
          1627550008.428,
          "26.631994009017944"
        ]
      },
      {
        "metric": {
          "job": "prometheus-k8s-thanos-sidecar",
          "pod": "prometheus-k8s-1"
        },
        "value": [
          1627550008.428,
          "28.255205631256104"
        ]
      }
    ]
  }
}

Comment 12 errata-xmlrpc 2021-08-10 11:27:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.4 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2983