Description of problem: After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally. https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900 [FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical) Version-Release number of selected component (if applicable): oc --context build01 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-fc.3 True False 8d Cluster version is 4.7.0-fc.3 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: name: ThanosSidecarUnhealthy expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600 labels: severity: critical annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds. summary: Thanos Sidecar is unhealthy. Expected results: a document explaining what a cluster admin need to do while seeing this alert (since it is critial) Additional info: Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation. https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200
The alert fired during upgrading build01 from 4.7.0-rc.1 to 4.7.0. https://coreos.slack.com/archives/CHY2E1BL4/p1614788698069200
Considering the critically and frequency of this alert, I'm elevating it to medium/medium, and prioritize this BZ in the current sprint.
I added a link to a discussion we started upstream to make the `ThanosSidecarUnhealthy` and `ThanosSidecarPrometheusDown` alerts resilient to WAL replays. I also linked a recent fix that prevents the `ThanosSidecarUnhealthy` alert to fire instantly during upgrades. Once brought downstream, this should fix the failures we are often seeing in CI.
*** Bug 1940262 has been marked as a duplicate of this bug. ***
10% of CI runs fail on this alert
The PR attached to this BZ should fix the issue we've seen in CI where the alert is firing straight away instead of after 10 minutes. It also readjust the Thanos Sidecar alerts by decreasing their severity to `warning` and increasing their duration to 1 hour as per recent discussions around alerting in OCP. However, considering the urgency with the CI failures, it does not make the Thanos sidecar alerts more resilient to WAL replays as this is still in progress. Thus, I created https://bugzilla.redhat.com/show_bug.cgi?id=1942913 to track this effort.
Will this be backported to 4.7? We are still seeing this on every 4.7 z stream upgrade.
Test with payload 4.8.0-0.nightly-2021-03-25-160359 The issue is fixed by thanos-sidecar rule as the following, related alerts have severity warning and during 1h oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 20 thanos-sidecar - name: thanos-sidecar rules: - alert: ThanosSidecarPrometheusDown annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} cannot connect to Prometheus. summary: Thanos Sidecar cannot connect to Prometheus expr: | sum by (job, instance) (thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0) for: 1h labels: severity: warning - alert: ThanosSidecarBucketOperationsFailed annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} bucket operations are failing summary: Thanos Sidecar bucket operations are failing expr: | rate(thanos_objstore_bucket_operation_failures_total{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}[5m]) > 0 for: 1h labels: severity: warning - alert: ThanosSidecarUnhealthy annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds. summary: Thanos Sidecar is unhealthy. expr: | time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240 for: 1h labels: severity: warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438