As part of ensuring alerts don't fire during upgrade, we see ThanosSidecarUnhealthy fire on the first instance https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1372307576271671296 ThanosSidecarUnhealthy fired for 30 seconds with labels: {job="prometheus-k8s-thanos-sidecar", pod="prometheus-k8s-0", severity="critical"} Grabbing the prom data (switched to grabbing prometheus-k8s-1 in jobs) we can see that the value of the query for this series: thanos_sidecar_last_heartbeat_success_time_seconds{container="kube-rbac-proxy-thanos", endpoint="thanos-proxy", instance="10.129.2.24:10902", job="prometheus-k8s-thanos-sidecar", namespace="openshift-monitoring", pod="prometheus-k8s-0", service="prometheus-k8s-thanos-sidecar"} is 0 briefly (first scrape, probably right after restart), and then takes on a timestamp. When run from the recording rule for the alert time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) by (job, pod) >= 600 The value of the k8s-0 series above is briefly (time() - 0), which is >= 600. The alert fires because there is no "for" on the alert (so it fires right away), but if that's the case then the sidecar should not report 0 when it starts. It's likely we simply want a for of a reasonable minimum time period (for: 300s, >= 300s), but another option is to have the series not reported until it is synced (but that wouldn't catch up front issues). I see upstream the value is now 240s, but still has no "for".
Set to high because we are firing a trivial alert during an upgrade and confusing users.
A fix for this particular issue was merged recently upstream https://github.com/thanos-io/thanos/pull/3204, we will just need to bring it downstream.
*** This bug has been marked as a duplicate of bug 1921335 ***