Bug 2179991

Summary: VirtApiRESTErrorsBurst threshold high
Product: Container Native Virtualization (CNV) Reporter: Ohad <orevah>
Component: VirtualizationAssignee: ffossemo
Status: ASSIGNED --- QA Contact: Kedar Bidarkar <kbidarka>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.13.0CC: acardace, kedar.lad, sradco, stirabos
Target Milestone: ---Flags: sradco: needinfo? (orevah)
sradco: needinfo? (kedar.lad)
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
A Photo of the metric none

Description Ohad 2023-03-20 14:16:40 UTC
Created attachment 1952088 [details]
A Photo of the metric

Created attachment 1952088 [details]
A Photo of the metric

Created attachment 1952088 [details]
A Photo of the metric

Description of problem:
When trying to validate the VirtApiRESTErrorsBurst critical alerts with automation, this alert is not fired and stays on pending, when the rest calls is getting little lower than 0.8 it counting again for the 5 mins so the alert can't get to the firing state withing reasonable time, it can take more than 20 mins.



How reproducible:
This is the steps:
1.Scale virt-operator to zero (Otherwise, it will revert our changes to observe the alert),
oc -n openshift-cnv scale deployment virt-operator --replicas=0
2.Run a VM
3.Backup clusterrolebinding for kubevirt-apiserver,
oc get clusterrolebinding kubevirt-apiserver -o yaml > kubevirt-apiserver-clusterrolebinding.yaml
4.Delete clusterrolebinding for virt-api:
oc delete clusterrolebinding kubevirt-apiserver
Now Wait for a while and observe the alert.
5.Recreate the clusterrolebinding:  oc apply -f kubevirt-apiserver-clusterrolebinding.yaml
6.Stop the VM
7.Rescale virt-operator to its original replicas: oc -n openshift-cnv scale deployment virt-operator --replicas=2

Actual results:
VirtApiRESTErrorsBurst in Pending state

Expected results:
VirtApiRESTErrorsBurst in fire state

Additional info:
Attached a Photo of the metric

Comment 1 Shirly Radco 2023-03-20 14:26:44 UTC
We need to drop the evaluation time.

Comment 2 sgott 2023-03-22 13:42:58 UTC
I'm very confused here. Looking at the steps to reproduce the scenario, it appears that virt-api has been left in a non-running state? Its not clear to me what removing its role binding does after its already running. But, the REST API endpoint for this alert is virt-api itself, is it not?

Shirly, you mention that we should drop the evaluation time, but it's not clear that will do anything useful. Can you help us understand what needs to be done and why?

Comment 3 Shirly Radco 2023-03-29 12:47:11 UTC
I can't really comment on the steps to reproduce.
I think this is a question for Ohad. Probably what he was trying to do is get the requests to fail in high %.

We need to drop the evaluation time, since in the expression itself we are looking back 5 minutes and checking the % of failed requests.
If the failure % is greater than 80% than the alert should fire immediately and not wait for 5m. It the same as VirtApiRESTErrorsHigh.

Comment 4 Kedar Bidarkar 2023-04-05 12:23:32 UTC
Targeting this to CNV 4.15 depending upon the severity and anticipated capacity at this point.

Comment 7 Kedar Bidarkar 2023-04-26 12:12:21 UTC
Discussed with Virt Devs, targeting it back to 4.14 Target Version.