Bug 1953798 - GCP e2e (parallel and upgrade) regularly trigger KubeAPIErrorBudgetBurn alert, also happens on AWS
Summary: GCP e2e (parallel and upgrade) regularly trigger KubeAPIErrorBudgetBurn alert...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 1969552
TreeView+ depends on / blocked
 
Reported: 2021-04-27 00:36 UTC by Clayton Coleman
Modified: 2021-07-27 23:04 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:04:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1120 0 None closed bug 1953798: stop doing live gets for optional cert syncer files that don't exist on disk yet 2021-05-20 18:38:44 UTC
Github openshift cluster-kube-apiserver-operator pull 1126 0 None closed Bug 1953798: Add apiserver burn rate SLO alerts 2021-06-10 14:56:24 UTC
Github openshift cluster-kube-apiserver-operator pull 1144 0 None closed Bug 1953798: Fix incorrect 'scope' label in SLO recording rules 2021-06-10 14:56:22 UTC
Github openshift cluster-monitoring-operator pull 1164 0 None closed Bug 1953798: Remove the upstream KubeAPIErrorBudgetBurn alerts 2021-06-10 14:56:21 UTC
Github openshift library-go pull 1064 0 None closed bug 1953798: stop doing live gets for optional cert syncer files that don't exist on disk yet 2021-05-20 18:38:51 UTC
Github openshift openshift-controller-manager pull 179 0 None closed bug 1953798: stop requeuing secret creation when namespaces are terminating 2021-05-20 18:38:53 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:04:20 UTC

Description Clayton Coleman 2021-04-27 00:36:44 UTC
KubeAPIErrorBudgetBurn is firing in about 15% of GCP e2e (parallel) and upgrade jobs, causing tests to fail.

The cause of the failure needs to be investigated before 4.8 because this may point to a significant issue causing the alert to fire on GCP clusters.

The GCP test will move this to "known failing" with this bug until a fix is located.

Comment 1 Clayton Coleman 2021-04-27 01:41:17 UTC
The burn rate for https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1386794875885719552 crosses 0.06 at 22:11:32Z which is 5m into the e2e test.  The burn rate peaked at ~ 0.10 in this test.  The read rate was the primary component (write was steady around 0.02).  The primary component of these errors appeared not to be errors, but instead to be slow resource calls.

I tried to restructure the burn query (which is written in a way that makes debugging difficult and should probably be structured differently)

(
label_replace(sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[30m])), "type", "error", "_none_", "")

or
label_replace(

sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope=~"resource|"}[30m]))
- (sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30m])) or vector(0))

, "type", "slow-resource", "_none_", "")

or
label_replace(

sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope=~"namespace"}[30m]))
- sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[30m]))

, "type", "slow-namespace", "_none_", "")

or
label_replace(

sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope="cluster"}[30m]))
- sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[30m]))

, "type", "slow-cluster", "_none_", "")



) / scalar(sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[30m])))

which showed that single resource is the most prominent (at least in digging).

After looking at the queries as p90:

histogram_quantile(0.9, sum by (le,group,resource,scope,subresource) ((rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|"}[30m]))))

it looks like proxy/log/exec are a significant portion of the burn budget (which e2e does heavily).  But you can't measure latency against a 0.1s target for calls which may run arbitrarily long:

{resource="nodes", scope="resource", subresource="proxy"}
{resource="pods", scope="resource", subresource="log"}

The other batch looks like CRDs:

{group="apiserver.openshift.io", resource="apirequestcounts", scope="resource"}
{group="config.openshift.io", resource="clusterversions", scope="resource"}
{group="operator.openshift.io", resource="csisnapshotcontrollers", scope="resource"}
{resource="persistentvolumes", scope="resource"}

It looks like clusterversions and persistent volumes are very early - this may be etcd quorum issues, or some other hang in the apiserver where requests go in but don't fail over.

So three bugs:

1. Long running requests should be excluded from error budget burn for get - this is just straight up incorrect for long running requests
2. CRDs appear to have some early install latency that is very wierd
3. The query itself is hard to debug and should be restructured for humans (hopefully we can get that with high performance).

Comment 2 Clayton Coleman 2021-04-27 01:43:16 UTC
Recommend getting monitoring team involved for the 3rd issue.

Comment 3 Clayton Coleman 2021-04-27 01:50:19 UTC
Test will be excluded in https://github.com/openshift/origin/pull/26107

Comment 5 Clayton Coleman 2021-05-03 13:17:10 UTC
A few instances on AWS too https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1389053116594262016 after this change was applied, but may be unrelated.

Comment 6 Clayton Coleman 2021-05-03 13:18:39 UTC
Still firing on GCP, although I think the rate is lower: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1077/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1389149804885446656

The issues I mentioned are structural (pod logs/exec) and would get worse the better we get at running tests.

Comment 7 W. Trevor King 2021-05-20 18:37:53 UTC
Filling in some example job output for Sippy and other searchers [1]:

  : [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
  Run #0: Failed	59s
    fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:493]: Unexpected error:
    ...
 
 
 
 ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
    ...
          "alertname": "KubeAPIErrorBudgetBurn",
    ...

  : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less
    Run #0: Failed	9s
   flake: Unexpected alert behavior during test:

    alert KubeAPIErrorBudgetBurn fired for 2 seconds with labels: {long="6h", severity="critical", short="30m"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798)
    alert KubeAPIErrorBudgetBurn fired for 2233 seconds with labels: {long="1d", severity="warning", short="2h"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798)
    alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="3d", severity="warning", short="6h"}
    alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="6h", severity="critical", short="30m"}
    alert PodDisruptionBudgetAtLimit pending for 1 seconds with labels: {__name__="ALERTS", namespace="e2e-disruption-6265", poddisruptionbudget="foo", severity="warning"}

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success
    Run #0: Failed	1h5m47s
    Unexpected alert behavior during upgrade:

    alert KubeAPIErrorBudgetBurn fired for 48 seconds with labels: {long="1d", severity="warning", short="2h"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798)
    alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="3d", severity="warning", short="6h"}
    ...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1395353573335764992

Comment 8 W. Trevor King 2021-05-24 23:31:58 UTC
Lots of impact over the past 24h:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+KubeAPIErrorBudgetBurn+fired+for' | grep 'failures match' | grep -v 'pull-ci\|rehearse-' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 12 runs, 17% failed, 50% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 11 runs, 55% failed, 150% of failures match = 82% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 11 runs, 100% failed, 9% of failures match = 9% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-remote-libvirt-compact-s390x-4.8 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.8 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact

Can we get a status update?

Comment 9 Ivan Sim 2021-05-25 03:28:54 UTC
We have a PR[1] which updates these recording rules to make them less flaky. This PR is dependent on another PR[2] which removes the same upstream alerts and rules from the monitoring stack, to avoid duplication. But it looks like CI is currently stuck.

[1] https://github.com/openshift/cluster-kube-apiserver-operator/pull/1126
[2] https://github.com/openshift/cluster-monitoring-operator/pull/1164

Comment 13 Ivan Sim 2021-06-01 02:53:51 UTC
We will monitor CI over the next few days to see if https://github.com/openshift/cluster-kube-apiserver-operator/pull/1126 resolves the flakiness.

Comment 14 W. Trevor King 2021-06-04 02:36:49 UTC
Looking a lot better over the past 48h:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort
  release-openshift-origin-installer-e2e-aws-disruptive-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Linking that job, and a few CI presubmits and such:

  $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | jq -r 'keys[]'
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1400577400915365888
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26200/pull-ci-openshift-origin-master-e2e-aws-disruptive/1400537065904410624
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26200/pull-ci-openshift-origin-master-e2e-aws-disruptive/1400576422270996480
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1132/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1400458370212696064
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1196/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1400484615667322880
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2423/pull-ci-openshift-machine-config-operator-master-e2e-aws-disruptive/1400400876337631232

That release job had:

  INFO[2021-06-04T01:20:26Z] alert KubeAPIErrorBudgetBurn fired for 297 seconds with labels: {long="6h", severity="critical", short="30m"}
  INFO[2021-06-04T01:20:26Z] alert KubeAPIErrorBudgetBurn fired for 780 seconds with labels: {long="1d", severity="warning", short="2h"}

Comment 16 Ivan Sim 2021-06-10 21:08:06 UTC
Some updates from the past 48 hours:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 29 runs, 52% failed, 20% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 30 runs, 80% failed, 13% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 16 runs, 94% failed, 7% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 14 runs, 71% failed, 20% of failures match = 14% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

What I observed is that in the [4.8|4.9]-e2e-gcp-upgrade and 4.9-upgrade-from-stable-4.8 jobs, the alerts are triggered by slow GET requests to the `apiresourcecounts` resource. All of these are SLO `warning`, not `critical` violations, per description in the upstream runbook[1].

The e2e-aws-disruptive-4.9 failure was caused by a high volume of 5xx errors to the `/readyz` subresources.

[1] https://github.com/prometheus-operator/kube-prometheus/wiki/KubeAPIErrorBudgetBurn#warning

Comment 17 Dan Li 2021-06-14 11:54:51 UTC
The Multi-Arch Power team tested this bug on 4.8-fc.9 (as it is similar to BZ 1970363), and they still observed the failure that's was reported in BZ 1970363.

Comment 18 Ivan Sim 2021-06-14 16:12:28 UTC
@danili i have documented some steps on how to debug this alert. See https://github.com/openshift/runbooks/pull/2. It boils down to identifying the sources of slow and/or error requests on your cluster.

Comment 19 Ivan Sim 2021-06-14 16:15:28 UTC
The overall CI results (in the past 60 hours) are looking a lot better:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=60h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort
release-openshift-ocp-installer-e2e-remote-libvirt-compact-s390x-4.8 (all) - 5 runs, 60% failed, 100% of failures match = 60% impact
release-openshift-ocp-installer-e2e-remote-libvirt-image-ecosystem-s390x-4.8 (all) - 5 runs, 40% failed, 50% of failures match = 20% impact

AIUI, the slow apirequestcount GET requests issue is resolved by https://github.com/openshift/kubernetes/pull/798

Comment 20 Dan Li 2021-06-14 16:30:19 UTC
Thank you Ivan.

Hi @pdsilva, please see Ivan's debugging document in Comment 18 for your team's testing purposes on BZ 1970363. Let us know if the issue persists.

Comment 21 Ke Wang 2021-06-15 06:04:32 UTC
Based on Comment #19, KubeAPIErrorBudgetBurn alert triggered only on power PC and will track it in bug 1970363. The PR fix worked fine, so move the bug VERIFIED.

Comment 25 errata-xmlrpc 2021-07-27 23:04:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.