Bug 1940262

Summary:	ThanosSidecarUnhealthy alert fires for 30s during 4.8 to 4.8 upgrade
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Monitoring	Assignee:	Damien Grisonnet <dgrisonn>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	alegrand, anpicker, dgrisonn, erooth, kakkoyun, lcosic, pkrupa, spasquie, surbania
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-03-18 09:01:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-03-18 00:21:38 UTC

As part of ensuring alerts don't fire during upgrade, we see ThanosSidecarUnhealthy fire on the first instance

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1372307576271671296

ThanosSidecarUnhealthy fired for 30 seconds with labels: {job="prometheus-k8s-thanos-sidecar", pod="prometheus-k8s-0", severity="critical"}

Grabbing the prom data (switched to grabbing prometheus-k8s-1 in jobs) we can see that the value of the query for this series:

thanos_sidecar_last_heartbeat_success_time_seconds{container="kube-rbac-proxy-thanos", endpoint="thanos-proxy", instance="10.129.2.24:10902", job="prometheus-k8s-thanos-sidecar", namespace="openshift-monitoring", pod="prometheus-k8s-0", service="prometheus-k8s-thanos-sidecar"}

is 0 briefly (first scrape, probably right after restart), and then takes on a timestamp.

When run from the recording rule for the alert

        time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job, pod) >= 600

The value of the k8s-0 series above is briefly (time() - 0), which is >= 600.

The alert fires because there is no "for" on the alert (so it fires right away), but if that's the case then the sidecar should not report 0 when it starts.  It's likely we simply want a for of a reasonable minimum time period (for: 300s, >= 300s), but another option is to have the series not reported until it is synced (but that wouldn't catch up front issues).  I see upstream the value is now 240s, but still has no "for".

Comment 1 Clayton Coleman 2021-03-18 00:22:11 UTC

Set to high because we are firing a trivial alert during an upgrade and confusing users.

Comment 2 Damien Grisonnet 2021-03-18 08:54:33 UTC

A fix for this particular issue was merged recently upstream https://github.com/thanos-io/thanos/pull/3204, we will just need to bring it downstream.

Comment 3 Simon Pasquier 2021-03-18 09:01:56 UTC


*** This bug has been marked as a duplicate of bug 1921335 ***