KubeAPIErrorBudgetBurn is firing in about 15% of GCP e2e (parallel) and upgrade jobs, causing tests to fail. The cause of the failure needs to be investigated before 4.8 because this may point to a significant issue causing the alert to fire on GCP clusters. The GCP test will move this to "known failing" with this bug until a fix is located.
The burn rate for https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1386794875885719552 crosses 0.06 at 22:11:32Z which is 5m into the e2e test. The burn rate peaked at ~ 0.10 in this test. The read rate was the primary component (write was steady around 0.02). The primary component of these errors appeared not to be errors, but instead to be slow resource calls. I tried to restructure the burn query (which is written in a way that makes debugging difficult and should probably be structured differently) ( label_replace(sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[30m])), "type", "error", "_none_", "") or label_replace( sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope=~"resource|"}[30m])) - (sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30m])) or vector(0)) , "type", "slow-resource", "_none_", "") or label_replace( sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope=~"namespace"}[30m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[30m])) , "type", "slow-namespace", "_none_", "") or label_replace( sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope="cluster"}[30m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[30m])) , "type", "slow-cluster", "_none_", "") ) / scalar(sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[30m]))) which showed that single resource is the most prominent (at least in digging). After looking at the queries as p90: histogram_quantile(0.9, sum by (le,group,resource,scope,subresource) ((rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|"}[30m])))) it looks like proxy/log/exec are a significant portion of the burn budget (which e2e does heavily). But you can't measure latency against a 0.1s target for calls which may run arbitrarily long: {resource="nodes", scope="resource", subresource="proxy"} {resource="pods", scope="resource", subresource="log"} The other batch looks like CRDs: {group="apiserver.openshift.io", resource="apirequestcounts", scope="resource"} {group="config.openshift.io", resource="clusterversions", scope="resource"} {group="operator.openshift.io", resource="csisnapshotcontrollers", scope="resource"} {resource="persistentvolumes", scope="resource"} It looks like clusterversions and persistent volumes are very early - this may be etcd quorum issues, or some other hang in the apiserver where requests go in but don't fail over. So three bugs: 1. Long running requests should be excluded from error budget burn for get - this is just straight up incorrect for long running requests 2. CRDs appear to have some early install latency that is very wierd 3. The query itself is hard to debug and should be restructured for humans (hopefully we can get that with high performance).
Recommend getting monitoring team involved for the 3rd issue.
Test will be excluded in https://github.com/openshift/origin/pull/26107
A few instances on AWS too https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1389053116594262016 after this change was applied, but may be unrelated.
Still firing on GCP, although I think the rate is lower: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1077/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1389149804885446656 The issues I mentioned are structural (pod logs/exec) and would get worse the better we get at running tests.
Filling in some example job output for Sippy and other searchers [1]: : [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] Run #0: Failed 59s fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:493]: Unexpected error: ... ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 ... "alertname": "KubeAPIErrorBudgetBurn", ... : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less Run #0: Failed 9s flake: Unexpected alert behavior during test: alert KubeAPIErrorBudgetBurn fired for 2 seconds with labels: {long="6h", severity="critical", short="30m"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798) alert KubeAPIErrorBudgetBurn fired for 2233 seconds with labels: {long="1d", severity="warning", short="2h"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798) alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="3d", severity="warning", short="6h"} alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="6h", severity="critical", short="30m"} alert PodDisruptionBudgetAtLimit pending for 1 seconds with labels: {__name__="ALERTS", namespace="e2e-disruption-6265", poddisruptionbudget="foo", severity="warning"} disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success Run #0: Failed 1h5m47s Unexpected alert behavior during upgrade: alert KubeAPIErrorBudgetBurn fired for 48 seconds with labels: {long="1d", severity="warning", short="2h"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798) alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="3d", severity="warning", short="6h"} ... [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1395353573335764992
Lots of impact over the past 24h: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+KubeAPIErrorBudgetBurn+fired+for' | grep 'failures match' | grep -v 'pull-ci\|rehearse-' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 12 runs, 17% failed, 50% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 11 runs, 55% failed, 150% of failures match = 82% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 11 runs, 100% failed, 9% of failures match = 9% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-e2e-gcp-serial-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-e2e-remote-libvirt-compact-s390x-4.8 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.8 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact Can we get a status update?
We have a PR[1] which updates these recording rules to make them less flaky. This PR is dependent on another PR[2] which removes the same upstream alerts and rules from the monitoring stack, to avoid duplication. But it looks like CI is currently stuck. [1] https://github.com/openshift/cluster-kube-apiserver-operator/pull/1126 [2] https://github.com/openshift/cluster-monitoring-operator/pull/1164
We will monitor CI over the next few days to see if https://github.com/openshift/cluster-kube-apiserver-operator/pull/1126 resolves the flakiness.
Looking a lot better over the past 48h: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort release-openshift-origin-installer-e2e-aws-disruptive-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Linking that job, and a few CI presubmits and such: $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1400577400915365888 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26200/pull-ci-openshift-origin-master-e2e-aws-disruptive/1400537065904410624 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26200/pull-ci-openshift-origin-master-e2e-aws-disruptive/1400576422270996480 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1132/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1400458370212696064 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1196/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1400484615667322880 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2423/pull-ci-openshift-machine-config-operator-master-e2e-aws-disruptive/1400400876337631232 That release job had: INFO[2021-06-04T01:20:26Z] alert KubeAPIErrorBudgetBurn fired for 297 seconds with labels: {long="6h", severity="critical", short="30m"} INFO[2021-06-04T01:20:26Z] alert KubeAPIErrorBudgetBurn fired for 780 seconds with labels: {long="1d", severity="warning", short="2h"}
Some updates from the past 48 hours: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 29 runs, 52% failed, 20% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 30 runs, 80% failed, 13% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 16 runs, 94% failed, 7% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 14 runs, 71% failed, 20% of failures match = 14% impact release-openshift-origin-installer-e2e-aws-disruptive-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact What I observed is that in the [4.8|4.9]-e2e-gcp-upgrade and 4.9-upgrade-from-stable-4.8 jobs, the alerts are triggered by slow GET requests to the `apiresourcecounts` resource. All of these are SLO `warning`, not `critical` violations, per description in the upstream runbook[1]. The e2e-aws-disruptive-4.9 failure was caused by a high volume of 5xx errors to the `/readyz` subresources. [1] https://github.com/prometheus-operator/kube-prometheus/wiki/KubeAPIErrorBudgetBurn#warning
The Multi-Arch Power team tested this bug on 4.8-fc.9 (as it is similar to BZ 1970363), and they still observed the failure that's was reported in BZ 1970363.
@danili i have documented some steps on how to debug this alert. See https://github.com/openshift/runbooks/pull/2. It boils down to identifying the sources of slow and/or error requests on your cluster.
The overall CI results (in the past 60 hours) are looking a lot better: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=60h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort release-openshift-ocp-installer-e2e-remote-libvirt-compact-s390x-4.8 (all) - 5 runs, 60% failed, 100% of failures match = 60% impact release-openshift-ocp-installer-e2e-remote-libvirt-image-ecosystem-s390x-4.8 (all) - 5 runs, 40% failed, 50% of failures match = 20% impact AIUI, the slow apirequestcount GET requests issue is resolved by https://github.com/openshift/kubernetes/pull/798
Thank you Ivan. Hi @pdsilva, please see Ivan's debugging document in Comment 18 for your team's testing purposes on BZ 1970363. Let us know if the issue persists.
Based on Comment #19, KubeAPIErrorBudgetBurn alert triggered only on power PC and will track it in bug 1970363. The PR fix worked fine, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438