Bug 1940262 - ThanosSidecarUnhealthy alert fires for 30s during 4.8 to 4.8 upgrade
Summary: ThanosSidecarUnhealthy alert fires for 30s during 4.8 to 4.8 upgrade
Keywords:
Status: CLOSED DUPLICATE of bug 1921335
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-18 00:21 UTC by Clayton Coleman
Modified: 2021-03-18 09:01 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-18 09:01:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github thanos-io thanos pull 3204 0 None closed mixin: Use sidecar's metric timestamp for healthcheck 2021-03-18 08:54:31 UTC

Description Clayton Coleman 2021-03-18 00:21:38 UTC
As part of ensuring alerts don't fire during upgrade, we see ThanosSidecarUnhealthy fire on the first instance

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1372307576271671296

ThanosSidecarUnhealthy fired for 30 seconds with labels: {job="prometheus-k8s-thanos-sidecar", pod="prometheus-k8s-0", severity="critical"}

Grabbing the prom data (switched to grabbing prometheus-k8s-1 in jobs) we can see that the value of the query for this series:

thanos_sidecar_last_heartbeat_success_time_seconds{container="kube-rbac-proxy-thanos", endpoint="thanos-proxy", instance="10.129.2.24:10902", job="prometheus-k8s-thanos-sidecar", namespace="openshift-monitoring", pod="prometheus-k8s-0", service="prometheus-k8s-thanos-sidecar"}

is 0 briefly (first scrape, probably right after restart), and then takes on a timestamp.

When run from the recording rule for the alert

        time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job, pod) >= 600

The value of the k8s-0 series above is briefly (time() - 0), which is >= 600.

The alert fires because there is no "for" on the alert (so it fires right away), but if that's the case then the sidecar should not report 0 when it starts.  It's likely we simply want a for of a reasonable minimum time period (for: 300s, >= 300s), but another option is to have the series not reported until it is synced (but that wouldn't catch up front issues).  I see upstream the value is now 240s, but still has no "for".

Comment 1 Clayton Coleman 2021-03-18 00:22:11 UTC
Set to high because we are firing a trivial alert during an upgrade and confusing users.

Comment 2 Damien Grisonnet 2021-03-18 08:54:33 UTC
A fix for this particular issue was merged recently upstream https://github.com/thanos-io/thanos/pull/3204, we will just need to bring it downstream.

Comment 3 Simon Pasquier 2021-03-18 09:01:56 UTC

*** This bug has been marked as a duplicate of bug 1921335 ***


Note You need to log in before you can comment on or make changes to this bug.