2033378 – Prometheus is not highly available

Bug 2033378 - Prometheus is not highly available

Summary: Prometheus is not highly available

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Test Framework
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.z
Assignee:	W. Trevor King
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	2030539
Blocks:	2033379
TreeView+	depends on / blocked

Reported:	2021-12-16 16:08 UTC by Ben Parees
Modified:	2021-12-20 13:56 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2030539
Environment:
Last Closed:	2021-12-16 16:12:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2021-12-16 16:08:57 UTC

+++ This bug was initially created as a clone of Bug #2030539 +++

In a 4.7 -> 4.8 update [1]:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h26m47s
  Dec  8 19:23:38.677: Unexpected alerts fired or pending during the upgrade:

    Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade

Turns out this is fairly common in jobs run under a 4.8 test suite:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 14 runs, 50% failed, 14% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 57% failed, 38% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 64% of failures match = 64% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 111 runs, 40% failed, 11% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 111 runs, 45% failed, 24% of failures match = 11% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 21% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 86% failed, 8% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 111 runs, 38% failed, 29% of failures match = 11% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 40 runs, 88% failed, 11% of failures match = 10% impact
periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact
pull-ci-openshift-insights-operator-release-4.8-e2e-agnostic-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
pull-ci-openshift-machine-config-operator-release-4.8-e2e-azure-upgrade (all) - 12 runs, 67% failed, 13% of failures match = 8% impact
pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 9 runs, 44% failed, 25% of failures match = 11% impact
pull-ci-openshift-ovn-kubernetes-release-4.8-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
rehearse-24171-periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-launch-gcp-modern (all) - 1114 runs, 29% failed, 0% of failures match = 0% impact

In [1], and presumably most of the others, it's because both Prom pods end up on the same node:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1468632049781837824/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/events.json | jq -r '.items | sort_by(.metadata.creationTimestamp)[] | select((.metadata.name | startswith("prometheus-k8s-")) and (.reason == "FailedScheduling" or .reason == "Scheduled" or .reason == "Killing")) | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message'
...
2021-12-08T18:36:12Z null prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T18:36:12Z null prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container prometheus
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container kube-rbac-proxy-thanos
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container prom-label-proxy
2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container kube-rbac-proxy
2021-12-08T19:05:13Z 1 prometheus-k8s-1 Killing: Stopping container prometheus-proxy
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container prometheus
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container kube-rbac-proxy-thanos
2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container prom-label-proxy
2021-12-08T19:05:16Z 1 prometheus-k8s-0 Killing: Stopping container kube-rbac-proxy
2021-12-08T19:05:16Z 1 prometheus-k8s-0 Killing: Stopping container prometheus-proxy
2021-12-08T19:05:21Z null prometheus-k8s-0 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:21Z null prometheus-k8s-1 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:22Z null prometheus-k8s-0 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:05:22Z null prometheus-k8s-1 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
2021-12-08T19:10:54Z null prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6
2021-12-08T19:10:54Z null prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6

Bug 2021097 went out in 4.9.8 with some operator smarts for separating co-located Prometheus pods.  But looks like PDBs that would actually lock in that separation are still unique to master/4.10:

cluster-monitoring-operator$ git log --oneline origin/master | grep PDB
49960048 Bug 1955489: enable hard-anti affinity and PDB for Alertmanager (#1489)
b9b7644e *: enable hard anti-affinity + PDB for Prometheus and Ruler
39addeb4 pkg/{manifests,tasks}: Remove PDB for prometheus and alertmanager
f056c723 pkg/tasks: fix creation of alertmanager PDB
cddf199b pkg/{client,manifests,tasks}: apply PDB for alertmanager and promethei
6fd19a23 pkg: wire prometheus-adapter PDB
cluster-monitoring-operator $ git log --oneline origin/release-4.9 | grep PDB
39addeb4 pkg/{manifests,tasks}: Remove PDB for prometheus and alertmanager
f056c723 pkg/tasks: fix creation of alertmanager PDB
cddf199b pkg/{client,manifests,tasks}: apply PDB for alertmanager and promethei
6fd19a23 pkg: wire prometheus-adapter PDB

So possibly mostly luck (and the bug 2021097 fix?) that keep Prom more HA on 4.9+.  The reason 4.8 jobs have trouble while 4.7 do not is that the test is new in 4.8 [2]:

origin$ git diff origin/release-4.7..origin/release-4.8 -- test/extended/prometheus/prometheus.go | grep -2 'no gaps'
 
-       g.It("shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured", func() {
+       g.It("shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing", func() {
                if len(os.Getenv("TEST_UNSUPPORTED_ALLOW_VERSION_SKEW")) > 0 {
                        e2eskipper.Skipf("Test is disabled to allow cluster components to have different versions, and skewed versions trigger multiple other alerts")
--
-                       fmt.Sprintf(`count_over_time(ALERTS{alertstate="firing",alertname="Watchdog", severity="none"}[%s])`, testDuration): true,
+               // Invariant: The watchdog alert should be firing continuously during the whole test via the thanos
+               // querier (which should have no gaps when it queries the individual stores). Allow zero or one changes
+               // to the presence of this series (zero if data is preserved over test, one if data is lost over test).
+               // This would not catch the alert stopping firing, but we catch that in other places and tests.

Possible mitigations:

a. Backport more of the HA logic so that 4.8 and updates into it are less likely to experience disruption.
b. Drop the Watchdog-continuity test from the 4.8 origin suite.  We're now a good ways into 4.8.z, and folks don't seem all that mad about occasional, few-minute Prom outages when they're unlucky enough to have colocated Prom pods.  When they eventually update to 4.9, the bug 2021097 fix will help them pry the Prom pods apart, after which they won't even have those short outages.
c. Some kind of openhshift/release logic to pry the Prom pods apart and put them on separate nodes before launching the test suite.
d. Other?

I'm fine with all of these, but if nobody else has opinions, I think I'm split between (b) and (c).  (b) seems sufficient based on recent failure data, but (c) might help things like 4.7 -> 4.8 -> 4.9 jobs where (b) won't apply, and we're unlikely to want to soften 4.9 origin...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1468632049781837824
[2]: https://github.com/openshift/origin/pull/26020

Comment 1 Ben Parees 2021-12-16 16:12:36 UTC

This is only happening in 4.8, so marking this 4.9 bug as verified.



 w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort
bash: w3m: command not found...
Install package 'w3m' to provide command 'w3m'? [N/y] y

Proceed with changes? [N/y] y

periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 79% failed, 27% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 118 runs, 35% failed, 10% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 118 runs, 67% failed, 14% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-uwm (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 7% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 93% failed, 23% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 43% of failures match = 43% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 117 runs, 65% failed, 22% of failures match = 15% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 32 runs, 100% failed, 16% of failures match = 16% impact
periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact
pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact
pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
pull-ci-openshift-router-release-4.8-e2e-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
rehearse-24212-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-origin-installer-launch-gcp-modern (all) - 1236 runs, 23% failed, 0% of failures match = 0% impact

Note You need to log in before you can comment on or make changes to this bug.