Bug 2032045 - When alert VirtControllerRESTErrorsHigh triggered it keeps in Firing state for hours (even when there are no failed api calls anymore)
Summary: When alert VirtControllerRESTErrorsHigh triggered it keeps in Firing state fo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Barak
QA Contact: Denys Shchedrivyi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-13 22:34 UTC by Denys Shchedrivyi
Modified: 2022-07-21 11:46 UTC (History)
6 users (show)

Fixed In Version: virt-operator-container-v4.10.0-203 hco-bundle-registry-container-v4.10.0-635
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-16 15:57:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
graph with existing solution based on sum_over_time() PromQL (47.82 KB, image/png)
2021-12-13 22:34 UTC, Denys Shchedrivyi
no flags Details
the graph based on rate() expression (46.15 KB, image/png)
2021-12-13 22:35 UTC, Denys Shchedrivyi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 6996 0 None Merged Fix kubevirt alerts 2022-01-07 22:16:44 UTC
Github kubevirt kubevirt pull 7145 0 None open [release-0.49] Consider Prometheus counters resets in our alerts 2022-01-26 22:05:39 UTC

Description Denys Shchedrivyi 2021-12-13 22:34:40 UTC
Created attachment 1846142 [details]
graph with existing solution based on sum_over_time() PromQL

Description of problem:
 From the description, alert VirtControllerRESTErrorsHigh should be triggered only when we have more than 5% of failed api calls for the last hour.
 Currently, as I see when this alert is triggered - it still keeps in Firing state for many hours even when I don't have failed api calls anymore.

The expression that we use for this alert shows strange graphs (see attached file existing_solution.png), our existing code is based on sum_over_time() expression which looks incorrectly. 

 For comparison, I've added another one graph which shows more real picture (see attached by_rate.png), which clearly shows that I had started generating failed calls at 7:30pm and finished at 12pm (with pause between 9 and 10pm)

 Based on all that info I guess that existing alert metrics collects wrong data and should be re-examined.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Steps to Reproduce:
 Unfortunately, I don't know the easiest way for reproducing it. I found that on my cluster I'm getting several failed api calls during VM creation, so in my case I've used script that was creating/removing 50 VMs for a certain period of time

Actual results:
alert VirtControllerRESTErrorsHigh triggered once keeps in Firing state for hours

Expected results:
alert should be removed as soon as we have less than 5% of failed api calls for the last hour.

Additional info:


The full list of VIRT alerts which have the same logic: 

VirtOperatorRESTErrorsBurst
VirtOperatorRESTErrorsHigh
VirtApiRESTErrorsBurst
VirtApiRESTErrorsHigh
VirtControllerRESTErrorsHigh
VirtControllerRESTErrorsBurst
VirtHandlerRESTErrorsHigh
VirtHandlerRESTErrorsBurst

Comment 1 Denys Shchedrivyi 2021-12-13 22:35:43 UTC
Created attachment 1846143 [details]
the graph based on rate() expression

Comment 3 sgott 2021-12-15 13:20:27 UTC
Shirly, just making you aware of this. Can you ensure Barak has any support he might need?

Comment 7 Kedar Bidarkar 2022-01-25 15:30:15 UTC
Moved to ASSIGNED state. Denys suggests, we have a back-port for 4.10, for this https://github.com/kubevirt/kubevirt/pull/7068
Else we move this bug to 4.11.

Comment 8 Denys Shchedrivyi 2022-02-02 15:30:12 UTC
Verified on v4.10.0-636

Comment 13 errata-xmlrpc 2022-03-16 15:57:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947


Note You need to log in before you can comment on or make changes to this bug.