1982704 – KubeAPIErrorBudgetBurn alert is often pending in CI with slow exec being a large contributor

Bug 1982704 - KubeAPIErrorBudgetBurn alert is often pending in CI with slow exec being a large contributor

Summary: KubeAPIErrorBudgetBurn alert is often pending in CI with slow exec being a la...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Antonio Ojea
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-15 13:45 UTC by Simon Pasquier
Modified:	2022-09-13 06:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: long running requests used for streaming were taken into account for the KubeAPIErrorBudgetBurn calculation Consequence: the alert based on KubeAPIErrorBudgetBurn was triggered causing false positives Fix: Exclude long-running requests from KubeAPIErrorBudgetBurn calculation Result: reduce false positives on KubeAPIErrorBudgetBurn metric
Clone Of:
Environment:
Last Closed:	2022-02-25 18:34:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Simon Pasquier 2021-07-15 13:45:19 UTC

Description of problem:
The KubeAPIErrorBudgetBurn alert is often pending in the CI. The alert is active because there's a (small) percentage of requests that have a latency above the 1 second threshold. Most of these requests breaking the SLO have been tracked down to be POST requests to the pods/exec subresource.

Version-Release number of selected component (if applicable):
4.9 (probably the same for previous releases)

How reproducible:
Often seen in the CI, for instance:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1273/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1415089024816648192

Steps to Reproduce:
1. I believe it would be possible to reproduce by launching "oc exec" commands at a steady rate.
2.
3.

Actual results:
The alert is pending.

Expected results:
The alert shouldn't be pending.

Additional info:
The current recording rules computing burn-rates for write requests [1] don't exclude the "exec|proxy|logs" subresources unlike the recording rules for read requests [2].
The alert has been identified as a flake and is ignored by the origin e2e test suite [3].

[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/005a95607cf9f8db490e962b549811d8bc0c5eaf/bindata/assets/alerts/kube-apiserver-slos.yaml#L302-L413
[2] https://github.com/openshift/cluster-kube-apiserver-operator/blob/005a95607cf9f8db490e962b549811d8bc0c5eaf/bindata/assets/alerts/kube-apiserver-slos.yaml#L64-L301
[3] https://github.com/openshift/origin/blob/4f99a10d9b0f2f47f17e50961aac7e39af065ab4/test/extended/prometheus/prometheus.go#L82-L87

Comment 1 Stefan Schimanski 2021-07-15 13:49:40 UTC

To be clear: this is about the exec (and other subresources) being missing from the alert in certain rules. It's not about hunting generic alert occurances in CI.

Comment 3 Wally 2022-02-25 18:34:34 UTC

Closing BZ here as this appears to be solved by https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/740

Note You need to log in before you can comment on or make changes to this bug.