+++ This bug was initially created as a clone of Bug #1921335 +++ Description of problem: After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally. https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900 [FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical) Version-Release number of selected component (if applicable): oc --context build01 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-fc.3 True False 8d Cluster version is 4.7.0-fc.3 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: name: ThanosSidecarUnhealthy expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600 labels: severity: critical annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds. summary: Thanos Sidecar is unhealthy. Expected results: a document explaining what a cluster admin need to do while seeing this alert (since it is critial) Additional info: Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation. https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200 --- Additional comment from Red Hat Bugzilla on 2021-02-04 09:57:10 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Damien Grisonnet on 2021-02-08 15:59:03 UTC --- No time to work on this bug this sprint, because of higher priority bugs and low team capacity. --- Additional comment from Damien Grisonnet on 2021-03-02 14:31:56 UTC --- No time to work on this bug this sprint, because of higher priority bugs and low team capacity. --- Additional comment from Hongkai Liu on 2021-03-03 16:36:27 UTC --- The alert fired during upgrading build01 from 4.7.0-rc.1 to 4.7.0. https://coreos.slack.com/archives/CHY2E1BL4/p1614788698069200 --- Additional comment from Damien Grisonnet on 2021-03-04 09:21:36 UTC --- Considering the critically and frequency of this alert, I'm elevating it to medium/medium, and prioritize this BZ in the current sprint. --- Additional comment from Damien Grisonnet on 2021-03-17 13:35:08 UTC --- I added a link to a discussion we started upstream to make the `ThanosSidecarUnhealthy` and `ThanosSidecarPrometheusDown` alerts resilient to WAL replays. I also linked a recent fix that prevents the `ThanosSidecarUnhealthy` alert to fire instantly during upgrades. Once brought downstream, this should fix the failures we are often seeing in CI. --- Additional comment from Simon Pasquier on 2021-03-18 09:01:54 UTC --- --- Additional comment from Simon Pasquier on 2021-03-18 09:17:35 UTC --- Moved severity to high to align with bug 1940262. --- Additional comment from Gabe Montero on 2021-03-22 18:28:35 UTC --- If it helps, I'm seeing this with some consistency in a master branch openshift-apiserver PR I have (2 failures and 2 passing in my last 4 e2e-aws-serial runs and the time of this comment). Failures: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/191/pull-ci-openshift-openshift-apiserver-master-e2e-aws-serial/1373992365517180928 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/191/pull-ci-openshift-openshift-apiserver-master-e2e-aws-serial/1372968169047592960 Successes: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/192/pull-ci-openshift-openshift-apiserver-master-e2e-aws-serial/1373264591286439936 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/192/pull-ci-openshift-openshift-apiserver-master-e2e-aws-serial/1371487375661731840 Some of the known upgrade issues are also getting in the way of my PR, so this is not a sole blocker, but depending on how things evolve over the next week, it could become one if it continues to occur frequently enough. --- Additional comment from Clayton Coleman on 2021-03-24 16:10:27 UTC --- 10% of CI runs fail on this alert --- Additional comment from Damien Grisonnet on 2021-03-25 10:38:13 UTC --- The PR attached to this BZ should fix the issue we've seen in CI where the alert is firing straight away instead of after 10 minutes. It also readjust the Thanos Sidecar alerts by decreasing their severity to `warning` and increasing their duration to 1 hour as per recent discussions around alerting in OCP. However, considering the urgency with the CI failures, it does not make the Thanos sidecar alerts more resilient to WAL replays as this is still in progress. Thus, I created https://bugzilla.redhat.com/show_bug.cgi?id=1942913 to track this effort. --- Additional comment from OpenShift Automated Release Tooling on 2021-03-25 11:56:51 UTC --- Elliott changed bug status from MODIFIED to ON_QA. --- Additional comment from Steve Kuznetsov on 2021-03-25 17:57:53 UTC --- Will this be backported to 4.7? We are still seeing this on every 4.7 z stream upgrade. --- Additional comment from hongyan li on 2021-03-26 01:32:29 UTC --- Test with payload 4.8.0-0.nightly-2021-03-25-160359 The issue is fixed by thanos-sidecar rule as the following, related alerts have severity warning and during 1h oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 20 thanos-sidecar - name: thanos-sidecar rules: - alert: ThanosSidecarPrometheusDown annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} cannot connect to Prometheus. summary: Thanos Sidecar cannot connect to Prometheus expr: | sum by (job, instance) (thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0) for: 1h labels: severity: warning - alert: ThanosSidecarBucketOperationsFailed annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} bucket operations are failing summary: Thanos Sidecar bucket operations are failing expr: | rate(thanos_objstore_bucket_operation_failures_total{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}[5m]) > 0 for: 1h labels: severity: warning - alert: ThanosSidecarUnhealthy annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds. summary: Thanos Sidecar is unhealthy. expr: | time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240 for: 1h labels: severity: warning
Test with cluster-bot and the PR, issue fixed oc get prometheusrules prometheus-k8s-rules -n openshift-monitoring -oyaml|grep -A 10 ThanosSidecarUnhealthy - alert: ThanosSidecarUnhealthy annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds. summary: Thanos Sidecar is unhealthy. expr: | time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240 labels: severity: critical
I closed the backport PR as the fix for bug 1921335 includes a regression: https://github.com/thanos-io/thanos/issues/3990.
The resolution of this bug depends on bug 1955586, so we can't make any progress on it until it is resolved.
Test with payload 4.8.0-0.nightly-2021-07-29-033031, alert rule changed and query data make sense. $ oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 10 ThanosSidecarUnhealthy - alert: ThanosSidecarUnhealthy annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds. summary: Thanos Sidecar is unhealthy. expr: | time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job,pod) >= 240 for: 1h labels: severity: warning $ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query --data-urlencode query='time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job,pod)' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 481 0 306 100 175 9562 5468 --:--:-- --:--:-- --:--:-- 15516 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "job": "prometheus-k8s-thanos-sidecar", "pod": "prometheus-k8s-0" }, "value": [ 1627550008.428, "26.631994009017944" ] }, { "metric": { "job": "prometheus-k8s-thanos-sidecar", "pod": "prometheus-k8s-1" }, "value": [ 1627550008.428, "28.255205631256104" ] } ] } }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.4 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2983