In a 4.7 -> 4.8 update [1]: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h26m47s Dec 8 19:23:38.677: Unexpected alerts fired or pending during the upgrade: Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade Turns out this is fairly common in jobs run under a 4.8 test suite: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 14 runs, 50% failed, 14% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 57% failed, 38% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 64% of failures match = 64% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 111 runs, 40% failed, 11% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 111 runs, 45% failed, 24% of failures match = 11% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 21% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 86% failed, 8% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 111 runs, 38% failed, 29% of failures match = 11% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 40 runs, 88% failed, 11% of failures match = 10% impact periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 29% of failures match = 29% impact pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact pull-ci-openshift-insights-operator-release-4.8-e2e-agnostic-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact pull-ci-openshift-machine-config-operator-release-4.8-e2e-azure-upgrade (all) - 12 runs, 67% failed, 13% of failures match = 8% impact pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 9 runs, 44% failed, 25% of failures match = 11% impact pull-ci-openshift-ovn-kubernetes-release-4.8-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact rehearse-24171-periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-launch-gcp-modern (all) - 1114 runs, 29% failed, 0% of failures match = 0% impact In [1], and presumably most of the others, it's because both Prom pods end up on the same node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1468632049781837824/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/events.json | jq -r '.items | sort_by(.metadata.creationTimestamp)[] | select((.metadata.name | startswith("prometheus-k8s-")) and (.reason == "FailedScheduling" or .reason == "Scheduled" or .reason == "Killing")) | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' ... 2021-12-08T18:36:12Z null prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6 2021-12-08T18:36:12Z null prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6 2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container prometheus 2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container kube-rbac-proxy-thanos 2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container prom-label-proxy 2021-12-08T19:05:12Z 1 prometheus-k8s-1 Killing: Stopping container kube-rbac-proxy 2021-12-08T19:05:13Z 1 prometheus-k8s-1 Killing: Stopping container prometheus-proxy 2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container prometheus 2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container kube-rbac-proxy-thanos 2021-12-08T19:05:15Z 1 prometheus-k8s-0 Killing: Stopping container prom-label-proxy 2021-12-08T19:05:16Z 1 prometheus-k8s-0 Killing: Stopping container kube-rbac-proxy 2021-12-08T19:05:16Z 1 prometheus-k8s-0 Killing: Stopping container prometheus-proxy 2021-12-08T19:05:21Z null prometheus-k8s-0 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. 2021-12-08T19:05:21Z null prometheus-k8s-1 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. 2021-12-08T19:05:22Z null prometheus-k8s-0 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. 2021-12-08T19:05:22Z null prometheus-k8s-1 FailedScheduling: 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. 2021-12-08T19:10:54Z null prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6 2021-12-08T19:10:54Z null prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-1zchqp42-3b3f8-7tnvx-worker-b-r8bf6 Bug 2021097 went out in 4.9.8 with some operator smarts for separating co-located Prometheus pods. But looks like PDBs that would actually lock in that separation are still unique to master/4.10: cluster-monitoring-operator$ git log --oneline origin/master | grep PDB 49960048 Bug 1955489: enable hard-anti affinity and PDB for Alertmanager (#1489) b9b7644e *: enable hard anti-affinity + PDB for Prometheus and Ruler 39addeb4 pkg/{manifests,tasks}: Remove PDB for prometheus and alertmanager f056c723 pkg/tasks: fix creation of alertmanager PDB cddf199b pkg/{client,manifests,tasks}: apply PDB for alertmanager and promethei 6fd19a23 pkg: wire prometheus-adapter PDB cluster-monitoring-operator $ git log --oneline origin/release-4.9 | grep PDB 39addeb4 pkg/{manifests,tasks}: Remove PDB for prometheus and alertmanager f056c723 pkg/tasks: fix creation of alertmanager PDB cddf199b pkg/{client,manifests,tasks}: apply PDB for alertmanager and promethei 6fd19a23 pkg: wire prometheus-adapter PDB So possibly mostly luck (and the bug 2021097 fix?) that keep Prom more HA on 4.9+. The reason 4.8 jobs have trouble while 4.7 do not is that the test is new in 4.8 [2]: origin$ git diff origin/release-4.7..origin/release-4.8 -- test/extended/prometheus/prometheus.go | grep -2 'no gaps' - g.It("shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured", func() { + g.It("shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing", func() { if len(os.Getenv("TEST_UNSUPPORTED_ALLOW_VERSION_SKEW")) > 0 { e2eskipper.Skipf("Test is disabled to allow cluster components to have different versions, and skewed versions trigger multiple other alerts") -- - fmt.Sprintf(`count_over_time(ALERTS{alertstate="firing",alertname="Watchdog", severity="none"}[%s])`, testDuration): true, + // Invariant: The watchdog alert should be firing continuously during the whole test via the thanos + // querier (which should have no gaps when it queries the individual stores). Allow zero or one changes + // to the presence of this series (zero if data is preserved over test, one if data is lost over test). + // This would not catch the alert stopping firing, but we catch that in other places and tests. Possible mitigations: a. Backport more of the HA logic so that 4.8 and updates into it are less likely to experience disruption. b. Drop the Watchdog-continuity test from the 4.8 origin suite. We're now a good ways into 4.8.z, and folks don't seem all that mad about occasional, few-minute Prom outages when they're unlucky enough to have colocated Prom pods. When they eventually update to 4.9, the bug 2021097 fix will help them pry the Prom pods apart, after which they won't even have those short outages. c. Some kind of openhshift/release logic to pry the Prom pods apart and put them on separate nodes before launching the test suite. d. Other? I'm fine with all of these, but if nobody else has opinions, I think I'm split between (b) and (c). (b) seems sufficient based on recent failure data, but (c) might help things like 4.7 -> 4.8 -> 4.9 jobs where (b) won't apply, and we're unlikely to want to soften 4.9 origin... [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1468632049781837824 [2]: https://github.com/openshift/origin/pull/26020
CI-search results from comment 0 show only 4.8 is impacted. Closing this 4.10.0 bug CURRENTRELEASE; 4.8 fix is happening via bug 2033379.
This is only happening in 4.8, so marking this 4.10 bug as verified. w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort bash: w3m: command not found... Install package 'w3m' to provide command 'w3m'? [N/y] y Proceed with changes? [N/y] y periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 79% failed, 27% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 118 runs, 35% failed, 10% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 118 runs, 67% failed, 14% of failures match = 9% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-uwm (all) - 7 runs, 43% failed, 33% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 7% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 93% failed, 23% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 43% of failures match = 43% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 117 runs, 65% failed, 22% of failures match = 15% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 32 runs, 100% failed, 16% of failures match = 16% impact periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 14% of failures match = 14% impact pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact pull-ci-openshift-router-release-4.8-e2e-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact rehearse-24212-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-origin-installer-launch-gcp-modern (all) - 1236 runs, 23% failed, 0% of failures match = 0% impact
Going to re-open and re-use this bug, I think we are seeing it on 4.10 but the message has changed slightly. Watchdog alert had 2 changes during the run, which may be a sign of a Prometheus outage See: https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*ovn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Slack discussion with monitoring team in: https://coreos.slack.com/archives/C0VMT03S5/p1642430954105300
After investigating and discussing with Philip, the bug is triggered by the fact that during the 4.9 > 4.10 upgrade, the Prometheus operator has to delete and recreate the prometheus-k8s statefulset. The reason is that the version of the Prometheus operator shipped in 4.10 removes the "app" label selector in the statefulset spec [1]. Statefulsets don't support (on purpose) updating the selectors field so when the operator encounters a "Forbidden" error during sts updates, it will delete the statefulset [2] and restart the reconciliation (which should recreate the sts). Looking at the operator's logs, it appears to be stuck in a hot-loop trying to delete the sts over and over [3] * At 06:10:36, the prometheus operator deletes the prometheus-k8s sts. * From 6:10:37 to 06:12:13, the operator repeatedly deletes the sts and the kube-controller-manager logs lots of "error syncing StatefulSet openshift-monitoring/prometheus-k8s, requeuing: object is being deleted". * At 06:12:54, the situation stabilizes which matches with the kube-controller-manager logging "StatefulSet has been deleted openshift-monitoring/prometheus-k8s". Looking at the logs from the prometheus pods, we see that the restart is delayed by almost 2 minutes: * At 06:10:35, k8s-0 and k8s-1 are stopped * At 06:13:15 k8s-1 starts * At 06:14:19 config loaded for k8s-1 * At 06:15:13 k8s-0 start * At 06:16:16 tsdb started + config loaded The time between the teardown of the pods and the pods being back exceeds the 5 minutes interval for metric staleness which explains the failure of the test. Effectively during at least 5 minutes, there's no WatchDog alert since the rule evaluation never happened. The hot-loop situation can be explained: because the operator returns right after requesting the deletion, the sts resource may still be present on the following run (e.g. the finalizers haven't been processed yet) and when the operator syncs again, it would see the sts being present and try again to delete it. The repeated requests may degrade the kube controller manager and the kubelet as we can see lots of attempts to start the pods during this period. Instead the operator should wait for the dependent resources to be cleaned up and the sts resource to be effectively deleted before proceeding. It wouldn't avoid the simultaneous unavailability of the prometheus pods but hopefully remove the unnecessary load. [1] https://github.com/prometheus-operator/prometheus-operator/pull/4350 [2] https://github.com/prometheus-operator/prometheus-operator/blob/d6707eddf10f9cddb7a26dfc90f4f69ff8053c11/pkg/prometheus/operator.go#L1298-L1313 [3] https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221642399860000%22,%221642400230000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade%2F1482940245992280064%5C%22%7D%20%7C%20unpack%20%20%7C%20container_name%3D%5C%22prometheus-operator%5C%22%22,%22refId%22:%22A%22%7D%5D
*** Bug 2042441 has been marked as a duplicate of this bug. ***
searched with https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job still can see [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_more failed, example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1490427103387062272 : [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_less 0s Watchdog alert had 15595 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime Feb 06 21:26:46.385 - 2100s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} Feb 06 22:03:58.385 - 2628s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}
Due to a change[1] upstream in prometheus-operator, the StatefulSet cannot be patched and must be deleted and recreated, since a selector label was deleted. With this in mind, there is no way we can guarantee 100% uptime during this particular upgrade. As mitigation against this particular problem, we have added comments to the upstream codebase[2] to warn about the impact of changing labels and thus having the operator recreate the StatefulSet. The PR linked here will simply avoid a hot-loop of re-entering the recreate flow and thus prolonging the period of downtime, which I believe is the best we can achieve given the circumstances. Factors that will contribute to downtime include the following: 1. Time taken to reschedule Pods 2. Length of time it takes to replay the WAL 3. Time it takes for initial scrapes 4. Rule evaluation The test that is failing in this instance looks at the Watchdog alert, and taking staleness[3] into account, will only trigger if the factors 1-4 listed above take longer than 5m total. [1] - https://github.com/prometheus-operator/prometheus-operator/pull/4350 [2] - https://github.com/prometheus-operator/prometheus-operator/blob/9fb1cf61dedf209dfcad8230ee963862277313a8/pkg/prometheus/statefulset.go#L663-L666 [3] - https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness
checked in CI jobs, https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job although we still could see [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_more failed testing, checked the prometheus operator's logs, it did not stuck in a hot-loop trying to delete the sts over and over set this bug to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056