Bug 1921335
Summary: | ThanosSidecarUnhealthy | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hongkai Liu <hongkliu> | |
Component: | Monitoring | Assignee: | Damien Grisonnet <dgrisonn> | |
Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.7 | CC: | alegrand, anpicker, ccoleman, dgrisonn, erooth, hongyli, kakkoyun, lcosic, pkrupa, skuznets, spasquie, wking | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1943565 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:37:10 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Hongkai Liu
2021-01-27 22:06:18 UTC
The alert fired during upgrading build01 from 4.7.0-rc.1 to 4.7.0. https://coreos.slack.com/archives/CHY2E1BL4/p1614788698069200 Considering the critically and frequency of this alert, I'm elevating it to medium/medium, and prioritize this BZ in the current sprint. I added a link to a discussion we started upstream to make the `ThanosSidecarUnhealthy` and `ThanosSidecarPrometheusDown` alerts resilient to WAL replays. I also linked a recent fix that prevents the `ThanosSidecarUnhealthy` alert to fire instantly during upgrades. Once brought downstream, this should fix the failures we are often seeing in CI. *** Bug 1940262 has been marked as a duplicate of this bug. *** 10% of CI runs fail on this alert The PR attached to this BZ should fix the issue we've seen in CI where the alert is firing straight away instead of after 10 minutes. It also readjust the Thanos Sidecar alerts by decreasing their severity to `warning` and increasing their duration to 1 hour as per recent discussions around alerting in OCP. However, considering the urgency with the CI failures, it does not make the Thanos sidecar alerts more resilient to WAL replays as this is still in progress. Thus, I created https://bugzilla.redhat.com/show_bug.cgi?id=1942913 to track this effort. Will this be backported to 4.7? We are still seeing this on every 4.7 z stream upgrade. Test with payload 4.8.0-0.nightly-2021-03-25-160359 The issue is fixed by thanos-sidecar rule as the following, related alerts have severity warning and during 1h oc get prometheusrules prometheus-k8s-prometheus-rules -n openshift-monitoring -oyaml|grep -A 20 thanos-sidecar - name: thanos-sidecar rules: - alert: ThanosSidecarPrometheusDown annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} cannot connect to Prometheus. summary: Thanos Sidecar cannot connect to Prometheus expr: | sum by (job, instance) (thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0) for: 1h labels: severity: warning - alert: ThanosSidecarBucketOperationsFailed annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.instance}} bucket operations are failing summary: Thanos Sidecar bucket operations are failing expr: | rate(thanos_objstore_bucket_operation_failures_total{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}[5m]) > 0 for: 1h labels: severity: warning - alert: ThanosSidecarUnhealthy annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds. summary: Thanos Sidecar is unhealthy. expr: | time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})) by (job,pod) >= 240 for: 1h labels: severity: warning Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |