Bug 1982704 - KubeAPIErrorBudgetBurn alert is often pending in CI with slow exec being a large contributor
Summary: KubeAPIErrorBudgetBurn alert is often pending in CI with slow exec being a la...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.11.0
Assignee: Antonio Ojea
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-15 13:45 UTC by Simon Pasquier
Modified: 2022-09-13 06:29 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: long running requests used for streaming were taken into account for the KubeAPIErrorBudgetBurn calculation Consequence: the alert based on KubeAPIErrorBudgetBurn was triggered causing false positives Fix: Exclude long-running requests from KubeAPIErrorBudgetBurn calculation Result: reduce false positives on KubeAPIErrorBudgetBurn metric
Clone Of:
Environment:
Last Closed: 2022-02-25 18:34:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Simon Pasquier 2021-07-15 13:45:19 UTC
Description of problem:
The KubeAPIErrorBudgetBurn alert is often pending in the CI. The alert is active because there's a (small) percentage of requests that have a latency above the 1 second threshold. Most of these requests breaking the SLO have been tracked down to be POST requests to the pods/exec subresource.

Version-Release number of selected component (if applicable):
4.9 (probably the same for previous releases)


How reproducible:
Often seen in the CI, for instance:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1273/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1415089024816648192


Steps to Reproduce:
1. I believe it would be possible to reproduce by launching "oc exec" commands at a steady rate.
2.
3.

Actual results:
The alert is pending.

Expected results:
The alert shouldn't be pending.

Additional info:
The current recording rules computing burn-rates for write requests [1] don't exclude the "exec|proxy|logs" subresources unlike the recording rules for read requests [2].
The alert has been identified as a flake and is ignored by the origin e2e test suite [3].

[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/005a95607cf9f8db490e962b549811d8bc0c5eaf/bindata/assets/alerts/kube-apiserver-slos.yaml#L302-L413
[2] https://github.com/openshift/cluster-kube-apiserver-operator/blob/005a95607cf9f8db490e962b549811d8bc0c5eaf/bindata/assets/alerts/kube-apiserver-slos.yaml#L64-L301
[3] https://github.com/openshift/origin/blob/4f99a10d9b0f2f47f17e50961aac7e39af065ab4/test/extended/prometheus/prometheus.go#L82-L87

Comment 1 Stefan Schimanski 2021-07-15 13:49:40 UTC
To be clear: this is about the exec (and other subresources) being missing from the alert in certain rules. It's not about hunting generic alert occurances in CI.

Comment 3 Wally 2022-02-25 18:34:34 UTC
Closing BZ here as this appears to be solved by https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/740


Note You need to log in before you can comment on or make changes to this bug.