1940262 – ThanosSidecarUnhealthy alert fires for 30s during 4.8 to 4.8 upgrade

Bug 1940262 - ThanosSidecarUnhealthy alert fires for 30s during 4.8 to 4.8 upgrade

Summary: ThanosSidecarUnhealthy alert fires for 30s during 4.8 to 4.8 upgrade

Keywords:
Status:	CLOSED DUPLICATE of bug 1921335
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Damien Grisonnet
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-18 00:21 UTC by Clayton Coleman
Modified:	2021-03-18 09:01 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-18 09:01:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	thanos-io thanos pull 3204	0	None	closed	mixin: Use sidecar's metric timestamp for healthcheck	2021-03-18 08:54:31 UTC

Description Clayton Coleman 2021-03-18 00:21:38 UTC

As part of ensuring alerts don't fire during upgrade, we see ThanosSidecarUnhealthy fire on the first instance

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1372307576271671296

ThanosSidecarUnhealthy fired for 30 seconds with labels: {job="prometheus-k8s-thanos-sidecar", pod="prometheus-k8s-0", severity="critical"}

Grabbing the prom data (switched to grabbing prometheus-k8s-1 in jobs) we can see that the value of the query for this series:

thanos_sidecar_last_heartbeat_success_time_seconds{container="kube-rbac-proxy-thanos", endpoint="thanos-proxy", instance="10.129.2.24:10902", job="prometheus-k8s-thanos-sidecar", namespace="openshift-monitoring", pod="prometheus-k8s-0", service="prometheus-k8s-thanos-sidecar"}

is 0 briefly (first scrape, probably right after restart), and then takes on a timestamp.

When run from the recording rule for the alert

        time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job, pod) >= 600

The value of the k8s-0 series above is briefly (time() - 0), which is >= 600.

The alert fires because there is no "for" on the alert (so it fires right away), but if that's the case then the sidecar should not report 0 when it starts.  It's likely we simply want a for of a reasonable minimum time period (for: 300s, >= 300s), but another option is to have the series not reported until it is synced (but that wouldn't catch up front issues).  I see upstream the value is now 240s, but still has no "for".

Comment 1 Clayton Coleman 2021-03-18 00:22:11 UTC

Set to high because we are firing a trivial alert during an upgrade and confusing users.

Comment 2 Damien Grisonnet 2021-03-18 08:54:33 UTC

A fix for this particular issue was merged recently upstream https://github.com/thanos-io/thanos/pull/3204, we will just need to bring it downstream.

Comment 3 Simon Pasquier 2021-03-18 09:01:56 UTC


*** This bug has been marked as a duplicate of bug 1921335 ***

Note You need to log in before you can comment on or make changes to this bug.