Bug 2030539
Summary: | Prometheus is not highly available | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | Monitoring | Assignee: | Philip Gough <pgough> | |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.10 | CC: | amuller, anpicker, aos-bugs, bparees, dgoodwin, erooth, hongyli, kai-uwe.rommel | |
Target Milestone: | --- | Keywords: | Reopened | |
Target Release: | 4.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2033378 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-10 16:32:46 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2033378, 2052201 |
Description
W. Trevor King
2021-12-09 04:56:55 UTC
CI-search results from comment 0 show only 4.8 is impacted. Closing this 4.10.0 bug CURRENTRELEASE; 4.8 fix is happening via bug 2033379. This is only happening in 4.8, so marking this 4.10 bug as verified. w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=336h&type=junit' | grep 'failures match' | sort bash: w3m: command not found... Install package 'w3m' to provide command 'w3m'? [N/y] y Proceed with changes? [N/y] y periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 14 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 14 runs, 79% failed, 27% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 14 runs, 100% failed, 86% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 118 runs, 35% failed, 10% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 118 runs, 67% failed, 14% of failures match = 9% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-uwm (all) - 7 runs, 43% failed, 33% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 14 runs, 100% failed, 7% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 14 runs, 93% failed, 23% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 7 runs, 100% failed, 43% of failures match = 43% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 117 runs, 65% failed, 22% of failures match = 15% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 32 runs, 100% failed, 16% of failures match = 16% impact periodic-ci-openshift-release-master-okd-4.8-upgrade-from-okd-4.7-e2e-upgrade-gcp (all) - 7 runs, 100% failed, 14% of failures match = 14% impact pull-ci-openshift-cluster-network-operator-release-4.8-e2e-agnostic-upgrade (all) - 13 runs, 77% failed, 40% of failures match = 31% impact pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact pull-ci-openshift-router-release-4.8-e2e-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact rehearse-24212-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-origin-installer-launch-gcp-modern (all) - 1236 runs, 23% failed, 0% of failures match = 0% impact Going to re-open and re-use this bug, I think we are seeing it on 4.10 but the message has changed slightly. Watchdog alert had 2 changes during the run, which may be a sign of a Prometheus outage See: https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*ovn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Slack discussion with monitoring team in: https://coreos.slack.com/archives/C0VMT03S5/p1642430954105300 After investigating and discussing with Philip, the bug is triggered by the fact that during the 4.9 > 4.10 upgrade, the Prometheus operator has to delete and recreate the prometheus-k8s statefulset. The reason is that the version of the Prometheus operator shipped in 4.10 removes the "app" label selector in the statefulset spec [1]. Statefulsets don't support (on purpose) updating the selectors field so when the operator encounters a "Forbidden" error during sts updates, it will delete the statefulset [2] and restart the reconciliation (which should recreate the sts). Looking at the operator's logs, it appears to be stuck in a hot-loop trying to delete the sts over and over [3] * At 06:10:36, the prometheus operator deletes the prometheus-k8s sts. * From 6:10:37 to 06:12:13, the operator repeatedly deletes the sts and the kube-controller-manager logs lots of "error syncing StatefulSet openshift-monitoring/prometheus-k8s, requeuing: object is being deleted". * At 06:12:54, the situation stabilizes which matches with the kube-controller-manager logging "StatefulSet has been deleted openshift-monitoring/prometheus-k8s". Looking at the logs from the prometheus pods, we see that the restart is delayed by almost 2 minutes: * At 06:10:35, k8s-0 and k8s-1 are stopped * At 06:13:15 k8s-1 starts * At 06:14:19 config loaded for k8s-1 * At 06:15:13 k8s-0 start * At 06:16:16 tsdb started + config loaded The time between the teardown of the pods and the pods being back exceeds the 5 minutes interval for metric staleness which explains the failure of the test. Effectively during at least 5 minutes, there's no WatchDog alert since the rule evaluation never happened. The hot-loop situation can be explained: because the operator returns right after requesting the deletion, the sts resource may still be present on the following run (e.g. the finalizers haven't been processed yet) and when the operator syncs again, it would see the sts being present and try again to delete it. The repeated requests may degrade the kube controller manager and the kubelet as we can see lots of attempts to start the pods during this period. Instead the operator should wait for the dependent resources to be cleaned up and the sts resource to be effectively deleted before proceeding. It wouldn't avoid the simultaneous unavailability of the prometheus pods but hopefully remove the unnecessary load. [1] https://github.com/prometheus-operator/prometheus-operator/pull/4350 [2] https://github.com/prometheus-operator/prometheus-operator/blob/d6707eddf10f9cddb7a26dfc90f4f69ff8053c11/pkg/prometheus/operator.go#L1298-L1313 [3] https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221642399860000%22,%221642400230000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade%2F1482940245992280064%5C%22%7D%20%7C%20unpack%20%20%7C%20container_name%3D%5C%22prometheus-operator%5C%22%22,%22refId%22:%22A%22%7D%5D *** Bug 2042441 has been marked as a duplicate of this bug. *** searched with https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job still can see [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_more failed, example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1490427103387062272 : [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_less 0s Watchdog alert had 15595 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime Feb 06 21:26:46.385 - 2100s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} Feb 06 22:03:58.385 - 2628s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} Due to a change[1] upstream in prometheus-operator, the StatefulSet cannot be patched and must be deleted and recreated, since a selector label was deleted. With this in mind, there is no way we can guarantee 100% uptime during this particular upgrade. As mitigation against this particular problem, we have added comments to the upstream codebase[2] to warn about the impact of changing labels and thus having the operator recreate the StatefulSet. The PR linked here will simply avoid a hot-loop of re-entering the recreate flow and thus prolonging the period of downtime, which I believe is the best we can achieve given the circumstances. Factors that will contribute to downtime include the following: 1. Time taken to reschedule Pods 2. Length of time it takes to replay the WAL 3. Time it takes for initial scrapes 4. Rule evaluation The test that is failing in this instance looks at the Watchdog alert, and taking staleness[3] into account, will only trigger if the factors 1-4 listed above take longer than 5m total. [1] - https://github.com/prometheus-operator/prometheus-operator/pull/4350 [2] - https://github.com/prometheus-operator/prometheus-operator/blob/9fb1cf61dedf209dfcad8230ee963862277313a8/pkg/prometheus/statefulset.go#L663-L666 [3] - https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness checked in CI jobs, https://search.ci.openshift.org/?search=Watchdog+alert+had&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*upgrade.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job although we still could see [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_more failed testing, checked the prometheus operator's logs, it did not stuck in a hot-loop trying to delete the sts over and over set this bug to VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |