Bug 1942913
Summary: | ThanosSidecarUnhealthy isn't resilient to WAL replays. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Damien Grisonnet <dgrisonn> |
Component: | Monitoring | Assignee: | Arunprasad Rajkumar <arajkuma> |
Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.7 | CC: | anpicker, erooth, hongyli, juzhao, spasquie, wking |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
When Prometheus restarts, it replays WAL to avoid data loss, however when WAL takes too long to complete the replay ThanosSidecarUnhealthy would be firing which is false positive.
Consequence:
As false positive alert ThanosSidecarUnhealthy will be fired
Fix:
Make use of a metric `prometheus_tsdb_data_replay_duration_seconds` from Prometheus TSDB and fire ThanosSidecarUnhealthy only when the above said metric is not yet set.
Result:
False positive firing of alert `ThanosSidecarUnhealthy` has been avoided.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-12 04:34:58 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Damien Grisonnet
2021-03-25 10:29:40 UTC
Upstream fix is in review state. checked with 4.10.0-0.nightly-2021-10-07-212540, ThanosSidecarUnhealthy is renamed to ThanosSidecarNoConnectionToStartedPrometheus, and no such alert from CI jobs https://search.ci.openshift.org/?search=ThanosSidecarNoConnectionToStartedPrometheus&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job - alert: ThanosSidecarNoConnectionToStartedPrometheus annotations: description: Thanos Sidecar {{$labels.instance}} is unhealthy. summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. expr: | thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0 AND on (namespace, pod) prometheus_tsdb_data_replay_duration_seconds != 0 for: 1h labels: severity: warning Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |