2032045 – When alert VirtControllerRESTErrorsHigh triggered it keeps in Firing state for hours (even when there are no failed api calls anymore)

Bug 2032045 - When alert VirtControllerRESTErrorsHigh triggered it keeps in Firing state for hours (even when there are no failed api calls anymore)

Summary: When alert VirtControllerRESTErrorsHigh triggered it keeps in Firing state fo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Barak
QA Contact:	Denys Shchedrivyi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-13 22:34 UTC by Denys Shchedrivyi
Modified:	2022-07-21 11:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:	virt-operator-container-v4.10.0-203 hco-bundle-registry-container-v4.10.0-635
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-16 15:57:31 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
graph with existing solution based on sum_over_time() PromQL (47.82 KB, image/png) 2021-12-13 22:34 UTC, Denys Shchedrivyi	no flags	Details
the graph based on rate() expression (46.15 KB, image/png) 2021-12-13 22:35 UTC, Denys Shchedrivyi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 6996	0	None	Merged	Fix kubevirt alerts	2022-01-07 22:16:44 UTC
Github	kubevirt kubevirt pull 7145	0	None	open	[release-0.49] Consider Prometheus counters resets in our alerts	2022-01-26 22:05:39 UTC

Description Denys Shchedrivyi 2021-12-13 22:34:40 UTC

Created attachment 1846142 [details]
graph with existing solution based on sum_over_time() PromQL

Description of problem:
 From the description, alert VirtControllerRESTErrorsHigh should be triggered only when we have more than 5% of failed api calls for the last hour.
 Currently, as I see when this alert is triggered - it still keeps in Firing state for many hours even when I don't have failed api calls anymore.

The expression that we use for this alert shows strange graphs (see attached file existing_solution.png), our existing code is based on sum_over_time() expression which looks incorrectly. 

 For comparison, I've added another one graph which shows more real picture (see attached by_rate.png), which clearly shows that I had started generating failed calls at 7:30pm and finished at 12pm (with pause between 9 and 10pm)

 Based on all that info I guess that existing alert metrics collects wrong data and should be re-examined.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Steps to Reproduce:
 Unfortunately, I don't know the easiest way for reproducing it. I found that on my cluster I'm getting several failed api calls during VM creation, so in my case I've used script that was creating/removing 50 VMs for a certain period of time

Actual results:
alert VirtControllerRESTErrorsHigh triggered once keeps in Firing state for hours

Expected results:
alert should be removed as soon as we have less than 5% of failed api calls for the last hour.

Additional info:


The full list of VIRT alerts which have the same logic: 

VirtOperatorRESTErrorsBurst
VirtOperatorRESTErrorsHigh
VirtApiRESTErrorsBurst
VirtApiRESTErrorsHigh
VirtControllerRESTErrorsHigh
VirtControllerRESTErrorsBurst
VirtHandlerRESTErrorsHigh
VirtHandlerRESTErrorsBurst

Comment 1 Denys Shchedrivyi 2021-12-13 22:35:43 UTC

Created attachment 1846143 [details]
the graph based on rate() expression

Comment 3 sgott 2021-12-15 13:20:27 UTC

Shirly, just making you aware of this. Can you ensure Barak has any support he might need?

Comment 7 Kedar Bidarkar 2022-01-25 15:30:15 UTC

Moved to ASSIGNED state. Denys suggests, we have a back-port for 4.10, for this https://github.com/kubevirt/kubevirt/pull/7068
Else we move this bug to 4.11.

Comment 8 Denys Shchedrivyi 2022-02-02 15:30:12 UTC

Verified on v4.10.0-636

Comment 13 errata-xmlrpc 2022-03-16 15:57:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947

Note You need to log in before you can comment on or make changes to this bug.