Bug 1953798

Summary: GCP e2e (parallel and upgrade) regularly trigger KubeAPIErrorBudgetBurn alert, also happens on AWS
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: aos-bugs, danili, deads, mfojtik, pdsilva, pkhaire, sttts, wking, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:04:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1969552    

Description Clayton Coleman 2021-04-27 00:36:44 UTC
KubeAPIErrorBudgetBurn is firing in about 15% of GCP e2e (parallel) and upgrade jobs, causing tests to fail.

The cause of the failure needs to be investigated before 4.8 because this may point to a significant issue causing the alert to fire on GCP clusters.

The GCP test will move this to "known failing" with this bug until a fix is located.

Comment 1 Clayton Coleman 2021-04-27 01:41:17 UTC
The burn rate for https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1386794875885719552 crosses 0.06 at 22:11:32Z which is 5m into the e2e test.  The burn rate peaked at ~ 0.10 in this test.  The read rate was the primary component (write was steady around 0.02).  The primary component of these errors appeared not to be errors, but instead to be slow resource calls.

I tried to restructure the burn query (which is written in a way that makes debugging difficult and should probably be structured differently)

(
label_replace(sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[30m])), "type", "error", "_none_", "")

or
label_replace(

sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope=~"resource|"}[30m]))
- (sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30m])) or vector(0))

, "type", "slow-resource", "_none_", "")

or
label_replace(

sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope=~"namespace"}[30m]))
- sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[30m]))

, "type", "slow-namespace", "_none_", "")

or
label_replace(

sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope="cluster"}[30m]))
- sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[30m]))

, "type", "slow-cluster", "_none_", "")



) / scalar(sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[30m])))

which showed that single resource is the most prominent (at least in digging).

After looking at the queries as p90:

histogram_quantile(0.9, sum by (le,group,resource,scope,subresource) ((rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|"}[30m]))))

it looks like proxy/log/exec are a significant portion of the burn budget (which e2e does heavily).  But you can't measure latency against a 0.1s target for calls which may run arbitrarily long:

{resource="nodes", scope="resource", subresource="proxy"}
{resource="pods", scope="resource", subresource="log"}

The other batch looks like CRDs:

{group="apiserver.openshift.io", resource="apirequestcounts", scope="resource"}
{group="config.openshift.io", resource="clusterversions", scope="resource"}
{group="operator.openshift.io", resource="csisnapshotcontrollers", scope="resource"}
{resource="persistentvolumes", scope="resource"}

It looks like clusterversions and persistent volumes are very early - this may be etcd quorum issues, or some other hang in the apiserver where requests go in but don't fail over.

So three bugs:

1. Long running requests should be excluded from error budget burn for get - this is just straight up incorrect for long running requests
2. CRDs appear to have some early install latency that is very wierd
3. The query itself is hard to debug and should be restructured for humans (hopefully we can get that with high performance).

Comment 2 Clayton Coleman 2021-04-27 01:43:16 UTC
Recommend getting monitoring team involved for the 3rd issue.

Comment 3 Clayton Coleman 2021-04-27 01:50:19 UTC
Test will be excluded in https://github.com/openshift/origin/pull/26107

Comment 5 Clayton Coleman 2021-05-03 13:17:10 UTC
A few instances on AWS too https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1389053116594262016 after this change was applied, but may be unrelated.

Comment 6 Clayton Coleman 2021-05-03 13:18:39 UTC
Still firing on GCP, although I think the rate is lower: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1077/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1389149804885446656

The issues I mentioned are structural (pod logs/exec) and would get worse the better we get at running tests.

Comment 7 W. Trevor King 2021-05-20 18:37:53 UTC
Filling in some example job output for Sippy and other searchers [1]:

  : [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
  Run #0: Failed	59s
    fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:493]: Unexpected error:
    ...
 
 
 
 ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
    ...
          "alertname": "KubeAPIErrorBudgetBurn",
    ...

  : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less
    Run #0: Failed	9s
   flake: Unexpected alert behavior during test:

    alert KubeAPIErrorBudgetBurn fired for 2 seconds with labels: {long="6h", severity="critical", short="30m"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798)
    alert KubeAPIErrorBudgetBurn fired for 2233 seconds with labels: {long="1d", severity="warning", short="2h"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798)
    alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="3d", severity="warning", short="6h"}
    alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="6h", severity="critical", short="30m"}
    alert PodDisruptionBudgetAtLimit pending for 1 seconds with labels: {__name__="ALERTS", namespace="e2e-disruption-6265", poddisruptionbudget="foo", severity="warning"}

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success
    Run #0: Failed	1h5m47s
    Unexpected alert behavior during upgrade:

    alert KubeAPIErrorBudgetBurn fired for 48 seconds with labels: {long="1d", severity="warning", short="2h"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1953798)
    alert KubeAPIErrorBudgetBurn pending for 1 seconds with labels: {__name__="ALERTS", long="3d", severity="warning", short="6h"}
    ...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1395353573335764992

Comment 8 W. Trevor King 2021-05-24 23:31:58 UTC
Lots of impact over the past 24h:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+KubeAPIErrorBudgetBurn+fired+for' | grep 'failures match' | grep -v 'pull-ci\|rehearse-' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 12 runs, 17% failed, 50% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 11 runs, 55% failed, 150% of failures match = 82% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 11 runs, 100% failed, 9% of failures match = 9% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-remote-libvirt-compact-s390x-4.8 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.8 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact

Can we get a status update?

Comment 9 Ivan Sim 2021-05-25 03:28:54 UTC
We have a PR[1] which updates these recording rules to make them less flaky. This PR is dependent on another PR[2] which removes the same upstream alerts and rules from the monitoring stack, to avoid duplication. But it looks like CI is currently stuck.

[1] https://github.com/openshift/cluster-kube-apiserver-operator/pull/1126
[2] https://github.com/openshift/cluster-monitoring-operator/pull/1164

Comment 13 Ivan Sim 2021-06-01 02:53:51 UTC
We will monitor CI over the next few days to see if https://github.com/openshift/cluster-kube-apiserver-operator/pull/1126 resolves the flakiness.

Comment 14 W. Trevor King 2021-06-04 02:36:49 UTC
Looking a lot better over the past 48h:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort
  release-openshift-origin-installer-e2e-aws-disruptive-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Linking that job, and a few CI presubmits and such:

  $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | jq -r 'keys[]'
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1400577400915365888
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26200/pull-ci-openshift-origin-master-e2e-aws-disruptive/1400537065904410624
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26200/pull-ci-openshift-origin-master-e2e-aws-disruptive/1400576422270996480
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1132/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1400458370212696064
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1196/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1400484615667322880
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2423/pull-ci-openshift-machine-config-operator-master-e2e-aws-disruptive/1400400876337631232

That release job had:

  INFO[2021-06-04T01:20:26Z] alert KubeAPIErrorBudgetBurn fired for 297 seconds with labels: {long="6h", severity="critical", short="30m"}
  INFO[2021-06-04T01:20:26Z] alert KubeAPIErrorBudgetBurn fired for 780 seconds with labels: {long="1d", severity="warning", short="2h"}

Comment 16 Ivan Sim 2021-06-10 21:08:06 UTC
Some updates from the past 48 hours:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 29 runs, 52% failed, 20% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 30 runs, 80% failed, 13% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 16 runs, 94% failed, 7% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 14 runs, 71% failed, 20% of failures match = 14% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

What I observed is that in the [4.8|4.9]-e2e-gcp-upgrade and 4.9-upgrade-from-stable-4.8 jobs, the alerts are triggered by slow GET requests to the `apiresourcecounts` resource. All of these are SLO `warning`, not `critical` violations, per description in the upstream runbook[1].

The e2e-aws-disruptive-4.9 failure was caused by a high volume of 5xx errors to the `/readyz` subresources.

[1] https://github.com/prometheus-operator/kube-prometheus/wiki/KubeAPIErrorBudgetBurn#warning

Comment 17 Dan Li 2021-06-14 11:54:51 UTC
The Multi-Arch Power team tested this bug on 4.8-fc.9 (as it is similar to BZ 1970363), and they still observed the failure that's was reported in BZ 1970363.

Comment 18 Ivan Sim 2021-06-14 16:12:28 UTC
@danili i have documented some steps on how to debug this alert. See https://github.com/openshift/runbooks/pull/2. It boils down to identifying the sources of slow and/or error requests on your cluster.

Comment 19 Ivan Sim 2021-06-14 16:15:28 UTC
The overall CI results (in the past 60 hours) are looking a lot better:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=60h&type=junit&search=alert+KubeAPIErrorBudgetBurn.*fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort
release-openshift-ocp-installer-e2e-remote-libvirt-compact-s390x-4.8 (all) - 5 runs, 60% failed, 100% of failures match = 60% impact
release-openshift-ocp-installer-e2e-remote-libvirt-image-ecosystem-s390x-4.8 (all) - 5 runs, 40% failed, 50% of failures match = 20% impact

AIUI, the slow apirequestcount GET requests issue is resolved by https://github.com/openshift/kubernetes/pull/798

Comment 20 Dan Li 2021-06-14 16:30:19 UTC
Thank you Ivan.

Hi @pdsilva, please see Ivan's debugging document in Comment 18 for your team's testing purposes on BZ 1970363. Let us know if the issue persists.

Comment 21 Ke Wang 2021-06-15 06:04:32 UTC
Based on Comment #19, KubeAPIErrorBudgetBurn alert triggered only on power PC and will track it in bug 1970363. The PR fix worked fine, so move the bug VERIFIED.

Comment 25 errata-xmlrpc 2021-07-27 23:04:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438