Created attachment 1846142 [details] graph with existing solution based on sum_over_time() PromQL Description of problem: From the description, alert VirtControllerRESTErrorsHigh should be triggered only when we have more than 5% of failed api calls for the last hour. Currently, as I see when this alert is triggered - it still keeps in Firing state for many hours even when I don't have failed api calls anymore. The expression that we use for this alert shows strange graphs (see attached file existing_solution.png), our existing code is based on sum_over_time() expression which looks incorrectly. For comparison, I've added another one graph which shows more real picture (see attached by_rate.png), which clearly shows that I had started generating failed calls at 7:30pm and finished at 12pm (with pause between 9 and 10pm) Based on all that info I guess that existing alert metrics collects wrong data and should be re-examined. Version-Release number of selected component (if applicable): 4.10 How reproducible: Steps to Reproduce: Unfortunately, I don't know the easiest way for reproducing it. I found that on my cluster I'm getting several failed api calls during VM creation, so in my case I've used script that was creating/removing 50 VMs for a certain period of time Actual results: alert VirtControllerRESTErrorsHigh triggered once keeps in Firing state for hours Expected results: alert should be removed as soon as we have less than 5% of failed api calls for the last hour. Additional info: The full list of VIRT alerts which have the same logic: VirtOperatorRESTErrorsBurst VirtOperatorRESTErrorsHigh VirtApiRESTErrorsBurst VirtApiRESTErrorsHigh VirtControllerRESTErrorsHigh VirtControllerRESTErrorsBurst VirtHandlerRESTErrorsHigh VirtHandlerRESTErrorsBurst
Created attachment 1846143 [details] the graph based on rate() expression
Shirly, just making you aware of this. Can you ensure Barak has any support he might need?
Moved to ASSIGNED state. Denys suggests, we have a back-port for 4.10, for this https://github.com/kubevirt/kubevirt/pull/7068 Else we move this bug to 4.11.
Verified on v4.10.0-636
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947