Bug 1940262

Summary: ThanosSidecarUnhealthy alert fires for 30s during 4.8 to 4.8 upgrade
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MonitoringAssignee: Damien Grisonnet <dgrisonn>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: alegrand, anpicker, dgrisonn, erooth, kakkoyun, lcosic, pkrupa, spasquie, surbania
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-18 09:01:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-03-18 00:21:38 UTC
As part of ensuring alerts don't fire during upgrade, we see ThanosSidecarUnhealthy fire on the first instance

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1372307576271671296

ThanosSidecarUnhealthy fired for 30 seconds with labels: {job="prometheus-k8s-thanos-sidecar", pod="prometheus-k8s-0", severity="critical"}

Grabbing the prom data (switched to grabbing prometheus-k8s-1 in jobs) we can see that the value of the query for this series:

thanos_sidecar_last_heartbeat_success_time_seconds{container="kube-rbac-proxy-thanos", endpoint="thanos-proxy", instance="10.129.2.24:10902", job="prometheus-k8s-thanos-sidecar", namespace="openshift-monitoring", pod="prometheus-k8s-0", service="prometheus-k8s-thanos-sidecar"}

is 0 briefly (first scrape, probably right after restart), and then takes on a timestamp.

When run from the recording rule for the alert

        time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job, pod) >= 600

The value of the k8s-0 series above is briefly (time() - 0), which is >= 600.

The alert fires because there is no "for" on the alert (so it fires right away), but if that's the case then the sidecar should not report 0 when it starts.  It's likely we simply want a for of a reasonable minimum time period (for: 300s, >= 300s), but another option is to have the series not reported until it is synced (but that wouldn't catch up front issues).  I see upstream the value is now 240s, but still has no "for".

Comment 1 Clayton Coleman 2021-03-18 00:22:11 UTC
Set to high because we are firing a trivial alert during an upgrade and confusing users.

Comment 2 Damien Grisonnet 2021-03-18 08:54:33 UTC
A fix for this particular issue was merged recently upstream https://github.com/thanos-io/thanos/pull/3204, we will just need to bring it downstream.

Comment 3 Simon Pasquier 2021-03-18 09:01:56 UTC

*** This bug has been marked as a duplicate of bug 1921335 ***