2030539 – Prometheus is not highly available

Bug 2030539 - Prometheus is not highly available

Summary: Prometheus is not highly available

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Philip Gough
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2042441 (view as bug list)
Depends On:
Blocks:	2033378 2052201
TreeView+	depends on / blocked

Reported:	2021-12-09 04:56 UTC by W. Trevor King
Modified:	2022-04-27 04:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2033378 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:32:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift prometheus-operator pull 152	0	None	open	Bug 2030539: Address race condition in recreate flow for statefulset	2022-01-28 11:37:02 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:33:11 UTC

Description W. Trevor King 2021-12-09 04:56:55 UTC

In a 4.7 -> 4.8 update [1]:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h26m47s
  Dec  8 19:23:38.677: Unexpected alerts fired or pending during the upgrade:

    Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade

Turns out this is fairly common in jobs run under a 4.8 test suite:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 14 runs, 50% failed, 14% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 57% failed, 38% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 64% of failures match = 64% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 111 runs, 40% failed, 11% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 111 runs, 45% failed, 24% of failures match = 11% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 21% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 86% failed, 8% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 111 runs, 38% failed, 29% of failures match = 11% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 40 runs, 88% failed, 11% of failures match = 10% impact
periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact
pull-ci-openshift-insights-operator-release-4.8-e2e-agnostic-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
pull-ci-openshift-machine-config-operator-release-4.8-e2e-azure-upgrade (all) - 12 runs, 67% failed, 13% of failures match = 8% impact
pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 9 runs, 44% failed, 25% of failures match = 11% impact
pull-ci-openshift-ovn-kubernetes-release-4.8-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
rehearse-24171-periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-launch-gcp-modern (all) - 1114 runs, 29% failed, 0% of failures match = 0% impact

In [1], and presumably most of the others, it's because both Prom pods end up on the same node:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1468632049781837824/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/events.json | jq -r '.items | sort_by(.metadata.creationTimestamp)[] | select((.metadata.name | startswith("prometheus-k8s-")) and (.reason == "FailedScheduling" or .reason == "Scheduled" or .reason == "Killing")) | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message'
...
2021-12-08T18:36:12Z null prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T18:36:12Z null prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container prometheus
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container kube-rbac-proxy-thanos
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container prom-label-proxy
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container kube-rbac-proxy
2021-12-08T19:05:13Z 1 prometheus-k8s-1 Killing: Stopping container prometheus-proxy
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container prometheus
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container kube-rbac-proxy-thanos
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container prom-label-proxy
2021-12-08T19:05:16Z 1 prometheus-k8s-0 Killing: Stopping container kube-rbac-proxy
2021-12-08T19:05:16Z 1 prometheus-k8s-0 Killing: Stopping container prometheus-proxy
2021-12-08T19:05:21Z null prometheus-k8s-0 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:21Z null prometheus-k8s-1 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:22Z null prometheus-k8s-0 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:22Z null prometheus-k8s-1 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:10:54Z null prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T19:10:54Z null prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6

Bug 2021097 went out in 4.9.8 with some operator smarts for separating co-located Prometheus pods.  But looks like PDBs that would actually lock in that separation are still unique to master/4.10:

cluster-monitoring-operator$ git log --oneline origin/master | grep PDB
49960048 Bug 1955489: enable hard-anti affinity and PDB for Alertmanager (#1489)
b9b7644e *: enable hard anti-affinity + PDB for Prometheus and Ruler
39addeb4 pkg/{manifests,tasks}: Remove PDB for prometheus and alertmanager
f056c723 pkg/tasks: fix creation of alertmanager PDB
cddf199b pkg/{client,manifests,tasks}: apply PDB for alertmanager and promethei
6fd19a23 pkg: wire prometheus-adapter PDB
cluster-monitoring-operator $ git log --oneline origin/release-4.9 | grep PDB
39addeb4 pkg/{manifests,tasks}: Remove PDB for prometheus and alertmanager
f056c723 pkg/tasks: fix creation of alertmanager PDB
cddf199b pkg/{client,manifests,tasks}: apply PDB for alertmanager and promethei
6fd19a23 pkg: wire prometheus-adapter PDB

So possibly mostly luck (and the bug 2021097 fix?) that keep Prom more HA on 4.9+.  The reason 4.8 jobs have trouble while 4.7 do not is that the test is new in 4.8 [2]:

origin$ git diff origin/release-4.7..origin/release-4.8 -- test/extended/prometheus/prometheus.go | grep -2 'no gaps'
 
-       g.It("shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured", func() {
+       g.It("shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing", func() {
                if len(os.Getenv("TEST_UNSUPPORTED_ALLOW_VERSION_SKEW")) > 0 {
                        e2eskipper.Skipf("Test is disabled to allow cluster components to have different versions, and skewed versions trigger multiple other alerts")
--
-                       fmt.Sprintf(`count_over_time(ALERTS{alertstate="firing",alertname="Watchdog", severity="none"}[%s])`, testDuration): true,
+               // Invariant: The watchdog alert should be firing continuously during the whole test via the thanos
+               // querier (which should have no gaps when it queries the individual stores). Allow zero or one changes
+               // to the presence of this series (zero if data is preserved over test, one if data is lost over test).
+               // This would not catch the alert stopping firing, but we catch that in other places and tests.

Possible mitigations:

a. Backport more of the HA logic so that 4.8 and updates into it are less likely to experience disruption.
b. Drop the Watchdog-continuity test from the 4.8 origin suite.  We're now a good ways into 4.8.z, and folks don't seem all that mad about occasional, few-minute Prom outages when they're unlucky enough to have colocated Prom pods.  When they eventually update to 4.9, the bug 2021097 fix will help them pry the Prom pods apart, after which they won't even have those short outages.
c. Some kind of openhshift/release logic to pry the Prom pods apart and put them on separate nodes before launching the test suite.
d. Other?

I'm fine with all of these, but if nobody else has opinions, I think I'm split between (b) and (c).  (b) seems sufficient based on recent failure data, but (c) might help things like 4.7 -> 4.8 -> 4.9 jobs where (b) won't apply, and we're unlikely to want to soften 4.9 origin...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1468632049781837824
[2]: https://github.com/openshift/origin/pull/26020

Comment 1 W. Trevor King 2021-12-16 16:12:06 UTC

CI-search results from comment 0 show only 4.8 is impacted.  Closing this 4.10.0 bug CURRENTRELEASE; 4.8 fix is happening via bug 2033379.

Comment 2 Ben Parees 2021-12-16 16:12:41 UTC

This is only happening in 4.8, so marking this 4.10 bug as verified.



 w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort
bash: w3m: command not found...
Install package 'w3m' to provide command 'w3m'? [N/y] y

Proceed with changes? [N/y] y

periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 79% failed, 27% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 118 runs, 35% failed, 10% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 118 runs, 67% failed, 14% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-uwm (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 7% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 93% failed, 23% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 43% of failures match = 43% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 117 runs, 65% failed, 22% of failures match = 15% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 32 runs, 100% failed, 16% of failures match = 16% impact
periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact
pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact
pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
pull-ci-openshift-router-release-4.8-e2e-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
rehearse-24212-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-origin-installer-launch-gcp-modern (all) - 1236 runs, 23% failed, 0% of failures match = 0% impact

Comment 3 Devan Goodwin 2022-01-17 17:40:15 UTC

Going to re-open and re-use this bug, I think we are seeing it on 4.10 but the message has changed slightly.

Watchdog alert had 2 changes during the run, which may be a sign of a Prometheus outage

See: https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*ovn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Slack discussion with monitoring team in: https://coreos.slack.com/archives/C0VMT03S5/p1642430954105300

Comment 4 Simon Pasquier 2022-01-18 17:05:48 UTC

After investigating and discussing with Philip, the bug is triggered by the fact that during the 4.9 > 4.10 upgrade, the Prometheus operator has to delete and recreate the prometheus-k8s statefulset. The reason is that the version of the Prometheus operator shipped in 4.10 removes the "app" label selector in the statefulset spec [1]. Statefulsets don't support (on purpose) updating the selectors field so when the operator encounters a "Forbidden" error during sts updates, it will delete the statefulset [2] and restart the reconciliation (which should recreate the sts).

Looking at the operator's logs, it appears to be stuck in a hot-loop trying to delete the sts over and over [3]
* At 06:10:36, the prometheus operator deletes the prometheus-k8s sts.
* From 6:10:37 to 06:12:13, the operator repeatedly deletes the sts and the kube-controller-manager logs lots of "error syncing StatefulSet openshift-monitoring/prometheus-k8s, requeuing: object is being deleted".
* At 06:12:54, the situation stabilizes which matches with the kube-controller-manager logging "StatefulSet has been deleted openshift-monitoring/prometheus-k8s".

Looking at the logs from the prometheus pods, we see that the restart is delayed by almost 2 minutes:
* At 06:10:35, k8s-0 and k8s-1 are stopped
* At 06:13:15 k8s-1 starts
* At 06:14:19 config loaded for k8s-1
* At 06:15:13 k8s-0 start
* At 06:16:16 tsdb started + config loaded

The time between the teardown of the pods and the pods being back exceeds the 5 minutes interval for metric staleness which explains the failure of the test. Effectively during at least 5 minutes, there's no WatchDog alert since the rule evaluation never happened.

The hot-loop situation can be explained: because the operator returns right after requesting the deletion, the sts resource may still be present on the following run (e.g. the finalizers haven't been processed yet) and when the operator syncs again, it would see the sts being present and try again to delete it. The repeated requests may degrade the kube controller manager and the kubelet as we can see lots of attempts to start the pods during this period.
Instead the operator should wait for the dependent resources to be cleaned up and the sts resource to be effectively deleted before proceeding. It wouldn't avoid the simultaneous unavailability of the prometheus pods but hopefully remove the unnecessary load.

[1] https://github.com/prometheus-operator/prometheus-operator/pull/4350
[2] https://github.com/prometheus-operator/prometheus-operator/blob/d6707eddf10f9cddb7a26dfc90f4f69ff8053c11/pkg/prometheus/operator.go#L1298-L1313
[3] https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221642399860000%22,%221642400230000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade%2F1482940245992280064%5C%22%7D%20%7C%20unpack%20%20%7C%20container_name%3D%5C%22prometheus-operator%5C%22%22,%22refId%22:%22A%22%7D%5D

Comment 5 Philip Gough 2022-01-26 16:59:22 UTC

*** Bug 2042441 has been marked as a duplicate of this bug. ***

Comment 10 Junqi Zhao 2022-02-07 05:39:25 UTC

searched with
https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

still can see 
[bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_more
failed, example:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1490427103387062272

: [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_less 	0s
Watchdog alert had 15595 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime

Feb 06 21:26:46.385 - 2100s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}
Feb 06 22:03:58.385 - 2628s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}

Comment 11 Philip Gough 2022-02-14 11:44:08 UTC

Due to a change[1] upstream in prometheus-operator, the StatefulSet cannot be patched and must be deleted and recreated, since a selector label was deleted.
With this in mind, there is no way we can guarantee 100% uptime during this particular upgrade.

As mitigation against this particular problem, we have added comments to the upstream codebase[2] to warn about the impact of changing labels and thus having the operator recreate the StatefulSet.

The PR linked here will simply avoid a hot-loop of re-entering the recreate flow and thus prolonging the period of downtime, which I believe is the best we can achieve given the circumstances.

Factors that will contribute to downtime include the following:

1. Time taken to reschedule Pods
2. Length of time it takes to replay the WAL 
3. Time it takes for initial scrapes
4. Rule evaluation

The test that is failing in this instance looks at the Watchdog alert, and taking staleness[3] into account, will only trigger if the factors 1-4 listed above take longer than 5m total.

[1] - https://github.com/prometheus-operator/prometheus-operator/pull/4350
[2] - https://github.com/prometheus-operator/prometheus-operator/blob/9fb1cf61dedf209dfcad8230ee963862277313a8/pkg/prometheus/statefulset.go#L663-L666
[3] - https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

Comment 12 Junqi Zhao 2022-02-15 03:43:35 UTC

checked in CI jobs,
https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
although we still could see
[bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_more
failed testing, checked the prometheus operator's logs, it did not stuck in a hot-loop trying to delete the sts over and over

set this bug to VERIFIED

Comment 14 errata-xmlrpc 2022-03-10 16:32:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.