Bug 2030539

Summary:	Prometheus is not highly available
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Monitoring	Assignee:	Philip Gough <pgough>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.10	CC:	amuller, anpicker, aos-bugs, bparees, dgoodwin, erooth, hongyli, kai-uwe.rommel
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2033378 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:32:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2033378, 2052201

Description W. Trevor King 2021-12-09 04:56:55 UTC

In a 4.7 -> 4.8 update [1]:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h26m47s
  Dec  8 19:23:38.677: Unexpected alerts fired or pending during the upgrade:

    Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade

Turns out this is fairly common in jobs run under a 4.8 test suite:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 14 runs, 50% failed, 14% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 57% failed, 38% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 64% of failures match = 64% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 111 runs, 40% failed, 11% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 111 runs, 45% failed, 24% of failures match = 11% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 21% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 86% failed, 8% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 111 runs, 38% failed, 29% of failures match = 11% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 40 runs, 88% failed, 11% of failures match = 10% impact
periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact
pull-ci-openshift-insights-operator-release-4.8-e2e-agnostic-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
pull-ci-openshift-machine-config-operator-release-4.8-e2e-azure-upgrade (all) - 12 runs, 67% failed, 13% of failures match = 8% impact
pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 9 runs, 44% failed, 25% of failures match = 11% impact
pull-ci-openshift-ovn-kubernetes-release-4.8-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
rehearse-24171-periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-launch-gcp-modern (all) - 1114 runs, 29% failed, 0% of failures match = 0% impact

In [1], and presumably most of the others, it's because both Prom pods end up on the same node:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1468632049781837824/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/events.json | jq -r '.items | sort_by(.metadata.creationTimestamp)[] | select((.metadata.name | startswith("prometheus-k8s-")) and (.reason == "FailedScheduling" or .reason == "Scheduled" or .reason == "Killing")) | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message'
...
2021-12-08T18:36:12Z null prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T18:36:12Z null prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container prometheus
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container kube-rbac-proxy-thanos
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container prom-label-proxy
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container kube-rbac-proxy
2021-12-08T19:05:13Z 1 prometheus-k8s-1 Killing: Stopping container prometheus-proxy
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container prometheus
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container kube-rbac-proxy-thanos
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container prom-label-proxy
2021-12-08T19:05:16Z 1 prometheus-k8s-0 Killing: Stopping container kube-rbac-proxy
2021-12-08T19:05:16Z 1 prometheus-k8s-0 Killing: Stopping container prometheus-proxy
2021-12-08T19:05:21Z null prometheus-k8s-0 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:21Z null prometheus-k8s-1 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:22Z null prometheus-k8s-0 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:22Z null prometheus-k8s-1 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:10:54Z null prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T19:10:54Z null prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6

Bug 2021097 went out in 4.9.8 with some operator smarts for separating co-located Prometheus pods.  But looks like PDBs that would actually lock in that separation are still unique to master/4.10:

cluster-monitoring-operator$ git log --oneline origin/master | grep PDB
49960048 Bug 1955489: enable hard-anti affinity and PDB for Alertmanager (#1489)
b9b7644e *: enable hard anti-affinity + PDB for Prometheus and Ruler
39addeb4 pkg/{manifests,tasks}: Remove PDB for prometheus and alertmanager
f056c723 pkg/tasks: fix creation of alertmanager PDB
cddf199b pkg/{client,manifests,tasks}: apply PDB for alertmanager and promethei
6fd19a23 pkg: wire prometheus-adapter PDB
cluster-monitoring-operator $ git log --oneline origin/release-4.9 | grep PDB
39addeb4 pkg/{manifests,tasks}: Remove PDB for prometheus and alertmanager
f056c723 pkg/tasks: fix creation of alertmanager PDB
cddf199b pkg/{client,manifests,tasks}: apply PDB for alertmanager and promethei
6fd19a23 pkg: wire prometheus-adapter PDB

So possibly mostly luck (and the bug 2021097 fix?) that keep Prom more HA on 4.9+.  The reason 4.8 jobs have trouble while 4.7 do not is that the test is new in 4.8 [2]:

origin$ git diff origin/release-4.7..origin/release-4.8 -- test/extended/prometheus/prometheus.go | grep -2 'no gaps'
 
-       g.It("shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured", func() {
+       g.It("shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing", func() {
                if len(os.Getenv("TEST_UNSUPPORTED_ALLOW_VERSION_SKEW")) > 0 {
                        e2eskipper.Skipf("Test is disabled to allow cluster components to have different versions, and skewed versions trigger multiple other alerts")
--
-                       fmt.Sprintf(`count_over_time(ALERTS{alertstate="firing",alertname="Watchdog", severity="none"}[%s])`, testDuration): true,
+               // Invariant: The watchdog alert should be firing continuously during the whole test via the thanos
+               // querier (which should have no gaps when it queries the individual stores). Allow zero or one changes
+               // to the presence of this series (zero if data is preserved over test, one if data is lost over test).
+               // This would not catch the alert stopping firing, but we catch that in other places and tests.

Possible mitigations:

a. Backport more of the HA logic so that 4.8 and updates into it are less likely to experience disruption.
b. Drop the Watchdog-continuity test from the 4.8 origin suite.  We're now a good ways into 4.8.z, and folks don't seem all that mad about occasional, few-minute Prom outages when they're unlucky enough to have colocated Prom pods.  When they eventually update to 4.9, the bug 2021097 fix will help them pry the Prom pods apart, after which they won't even have those short outages.
c. Some kind of openhshift/release logic to pry the Prom pods apart and put them on separate nodes before launching the test suite.
d. Other?

I'm fine with all of these, but if nobody else has opinions, I think I'm split between (b) and (c).  (b) seems sufficient based on recent failure data, but (c) might help things like 4.7 -> 4.8 -> 4.9 jobs where (b) won't apply, and we're unlikely to want to soften 4.9 origin...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1468632049781837824
[2]: https://github.com/openshift/origin/pull/26020

Comment 1 W. Trevor King 2021-12-16 16:12:06 UTC

CI-search results from comment 0 show only 4.8 is impacted.  Closing this 4.10.0 bug CURRENTRELEASE; 4.8 fix is happening via bug 2033379.

Comment 2 Ben Parees 2021-12-16 16:12:41 UTC

This is only happening in 4.8, so marking this 4.10 bug as verified.



 w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort
bash: w3m: command not found...
Install package 'w3m' to provide command 'w3m'? [N/y] y

Proceed with changes? [N/y] y

periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 79% failed, 27% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 118 runs, 35% failed, 10% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 118 runs, 67% failed, 14% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-uwm (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 7% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 93% failed, 23% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 43% of failures match = 43% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 117 runs, 65% failed, 22% of failures match = 15% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 32 runs, 100% failed, 16% of failures match = 16% impact
periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact
pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact
pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
pull-ci-openshift-router-release-4.8-e2e-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
rehearse-24212-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-origin-installer-launch-gcp-modern (all) - 1236 runs, 23% failed, 0% of failures match = 0% impact

Comment 3 Devan Goodwin 2022-01-17 17:40:15 UTC

Going to re-open and re-use this bug, I think we are seeing it on 4.10 but the message has changed slightly.

Watchdog alert had 2 changes during the run, which may be a sign of a Prometheus outage

See: https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*ovn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Slack discussion with monitoring team in: https://coreos.slack.com/archives/C0VMT03S5/p1642430954105300

Comment 4 Simon Pasquier 2022-01-18 17:05:48 UTC

After investigating and discussing with Philip, the bug is triggered by the fact that during the 4.9 > 4.10 upgrade, the Prometheus operator has to delete and recreate the prometheus-k8s statefulset. The reason is that the version of the Prometheus operator shipped in 4.10 removes the "app" label selector in the statefulset spec [1]. Statefulsets don't support (on purpose) updating the selectors field so when the operator encounters a "Forbidden" error during sts updates, it will delete the statefulset [2] and restart the reconciliation (which should recreate the sts).

Looking at the operator's logs, it appears to be stuck in a hot-loop trying to delete the sts over and over [3]
* At 06:10:36, the prometheus operator deletes the prometheus-k8s sts.
* From 6:10:37 to 06:12:13, the operator repeatedly deletes the sts and the kube-controller-manager logs lots of "error syncing StatefulSet openshift-monitoring/prometheus-k8s, requeuing: object is being deleted".
* At 06:12:54, the situation stabilizes which matches with the kube-controller-manager logging "StatefulSet has been deleted openshift-monitoring/prometheus-k8s".

Looking at the logs from the prometheus pods, we see that the restart is delayed by almost 2 minutes:
* At 06:10:35, k8s-0 and k8s-1 are stopped
* At 06:13:15 k8s-1 starts
* At 06:14:19 config loaded for k8s-1
* At 06:15:13 k8s-0 start
* At 06:16:16 tsdb started + config loaded

The time between the teardown of the pods and the pods being back exceeds the 5 minutes interval for metric staleness which explains the failure of the test. Effectively during at least 5 minutes, there's no WatchDog alert since the rule evaluation never happened.

The hot-loop situation can be explained: because the operator returns right after requesting the deletion, the sts resource may still be present on the following run (e.g. the finalizers haven't been processed yet) and when the operator syncs again, it would see the sts being present and try again to delete it. The repeated requests may degrade the kube controller manager and the kubelet as we can see lots of attempts to start the pods during this period.
Instead the operator should wait for the dependent resources to be cleaned up and the sts resource to be effectively deleted before proceeding. It wouldn't avoid the simultaneous unavailability of the prometheus pods but hopefully remove the unnecessary load.

[1] https://github.com/prometheus-operator/prometheus-operator/pull/4350
[2] https://github.com/prometheus-operator/prometheus-operator/blob/d6707eddf10f9cddb7a26dfc90f4f69ff8053c11/pkg/prometheus/operator.go#L1298-L1313
[3] https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221642399860000%22,%221642400230000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade%2F1482940245992280064%5C%22%7D%20%7C%20unpack%20%20%7C%20container_name%3D%5C%22prometheus-operator%5C%22%22,%22refId%22:%22A%22%7D%5D

Comment 5 Philip Gough 2022-01-26 16:59:22 UTC

*** Bug 2042441 has been marked as a duplicate of this bug. ***

Comment 10 Junqi Zhao 2022-02-07 05:39:25 UTC

searched with
https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

still can see 
[bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_more
failed, example:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1490427103387062272

: [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_less 	0s
Watchdog alert had 15595 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime

Feb 06 21:26:46.385 - 2100s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}
Feb 06 22:03:58.385 - 2628s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}

Comment 11 Philip Gough 2022-02-14 11:44:08 UTC

Due to a change[1] upstream in prometheus-operator, the StatefulSet cannot be patched and must be deleted and recreated, since a selector label was deleted.
With this in mind, there is no way we can guarantee 100% uptime during this particular upgrade.

As mitigation against this particular problem, we have added comments to the upstream codebase[2] to warn about the impact of changing labels and thus having the operator recreate the StatefulSet.

The PR linked here will simply avoid a hot-loop of re-entering the recreate flow and thus prolonging the period of downtime, which I believe is the best we can achieve given the circumstances.

Factors that will contribute to downtime include the following:

1. Time taken to reschedule Pods
2. Length of time it takes to replay the WAL 
3. Time it takes for initial scrapes
4. Rule evaluation

The test that is failing in this instance looks at the Watchdog alert, and taking staleness[3] into account, will only trigger if the factors 1-4 listed above take longer than 5m total.

[1] - https://github.com/prometheus-operator/prometheus-operator/pull/4350
[2] - https://github.com/prometheus-operator/prometheus-operator/blob/9fb1cf61dedf209dfcad8230ee963862277313a8/pkg/prometheus/statefulset.go#L663-L666
[3] - https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

Comment 12 Junqi Zhao 2022-02-15 03:43:35 UTC

checked in CI jobs,
https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
although we still could see
[bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_more
failed testing, checked the prometheus operator's logs, it did not stuck in a hot-loop trying to delete the sts over and over

set this bug to VERIFIED

Comment 14 errata-xmlrpc 2022-03-10 16:32:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056