2039539 – kube-apiserver burn budget alert fires more frequently with 1.23 rebase

Bug 2039539 - kube-apiserver burn budget alert fires more frequently with 1.23 rebase

Summary: kube-apiserver burn budget alert fires more frequently with 1.23 rebase

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Abu Kashem
QA Contact:	jmekkatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-11 21:49 UTC by Abu Kashem
Modified:	2022-08-26 15:02 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-26 15:02:01 UTC
Target Upstream Version:
Embargoed:
Flags:	wlewis: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 1130	None	Merged	Bug 2039539: UPSTREAM: <drop>: revert upstream PR 106306	2022-01-25 19:45:11 UTC
Github	openshift kubernetes pull 1143	None	Merged	Bug 2039539: Revert "UPSTREAM: <drop>: revert upstream PR 106306"	2022-01-28 13:06:32 UTC
Github	openshift origin pull 26748	None	Merged	Bug 2039539: Allow apiserver burn rate alert to fire in CI	2022-01-28 13:06:33 UTC

Description Abu Kashem 2022-01-11 21:49:42 UTC

kube-apiserver burn budget alert fires more frequently with 1.23 rebase. We need to find to find out the root cause and resolve the issue


(we will add more details in this BZ as we go)

Comment 1 Michal Fojtik 2022-01-11 21:52:15 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 2 Michal Fojtik 2022-01-11 22:22:15 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 5 Damien Grisonnet 2022-01-25 18:33:02 UTC

Moving the BZ back to Assigned since https://github.com/openshift/kubernetes/pull/1130 isn't the final fix since we used this PR as a test to confirm the cause of the regression.

Comment 6 Abu Kashem 2022-01-25 20:44:46 UTC

debugging notes are here - https://docs.google.com/document/d/1pz45fduHMTtJNRAKBHBCXqTCVfSMTx1Vv-0nEh9n0A8/edit

Comment 7 Abu Kashem 2022-01-26 20:57:42 UTC

setting it back to assigned so we can actually allow the alert to fire in ci (needed for verification) - https://github.com/openshift/origin/pull/26748

Comment 14 Abu Kashem 2022-01-28 15:38:47 UTC

Please look at the findings from TRT - https://bugzilla.redhat.com/attachment.cgi?id=1857419. It clearly shows improvements. The finding you have posted is a separate issue, I would recommend creating a new BZ with the new findings.

Comment 17 jmekkatt 2022-02-01 05:05:08 UTC

I too agree with that TRT graph you attached shows improvements after the fix build date. Hence we would move the ticket to verified with the attached reference. I will also create new bug as you suggested in https://bugzilla.redhat.com/show_bug.cgi?id=2039539#c12.

Note You need to log in before you can comment on or make changes to this bug.