Hide Forgot
This bug was initially created as a copy of Bug #1921335 I am copying this bug because: The ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts aren't resilient to WAL replays which mean that they might fire during OCP upgrades and there will be nothing to do for the administrator to resolve the alert aside from waiting. Description of problem: After upgrade to 4.7, alert ThanosSidecarUnhealthy has been fired occasionally. https://coreos.slack.com/archives/CHY2E1BL4/p1611783508050900 [FIRING:1] ThanosSidecarUnhealthy prometheus-k8s-thanos-sidecar (prometheus-k8s-0 openshift-monitoring/k8s critical) Version-Release number of selected component (if applicable): oc --context build01 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-fc.3 True False 8d Cluster version is 4.7.0-fc.3 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: name: ThanosSidecarUnhealthy expr: time() - max by(job, pod) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 600 labels: severity: critical annotations: description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds. summary: Thanos Sidecar is unhealthy. Expected results: a document explaining what a cluster admin need to do while seeing this alert (since it is critial) Additional info: Sergiusz helped me the last time to determine the cause. The logs and metric screenshot are there too in the slack conversation. https://coreos.slack.com/archives/C01K2ULGE1H/p1611073858038200
Upstream fix is in review state.
checked with 4.10.0-0.nightly-2021-10-07-212540, ThanosSidecarUnhealthy is renamed to ThanosSidecarNoConnectionToStartedPrometheus, and no such alert from CI jobs https://search.ci.openshift.org/?search=ThanosSidecarNoConnectionToStartedPrometheus&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job - alert: ThanosSidecarNoConnectionToStartedPrometheus annotations: description: Thanos Sidecar {{$labels.instance}} is unhealthy. summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. expr: | thanos_sidecar_prometheus_up{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"} == 0 AND on (namespace, pod) prometheus_tsdb_data_replay_duration_seconds != 0 for: 1h labels: severity: warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056