Created attachment 1952088 [details] A Photo of the metric Created attachment 1952088 [details] A Photo of the metric Created attachment 1952088 [details] A Photo of the metric Description of problem: When trying to validate the VirtApiRESTErrorsBurst critical alerts with automation, this alert is not fired and stays on pending, when the rest calls is getting little lower than 0.8 it counting again for the 5 mins so the alert can't get to the firing state withing reasonable time, it can take more than 20 mins. How reproducible: This is the steps: 1.Scale virt-operator to zero (Otherwise, it will revert our changes to observe the alert), oc -n openshift-cnv scale deployment virt-operator --replicas=0 2.Run a VM 3.Backup clusterrolebinding for kubevirt-apiserver, oc get clusterrolebinding kubevirt-apiserver -o yaml > kubevirt-apiserver-clusterrolebinding.yaml 4.Delete clusterrolebinding for virt-api: oc delete clusterrolebinding kubevirt-apiserver Now Wait for a while and observe the alert. 5.Recreate the clusterrolebinding: oc apply -f kubevirt-apiserver-clusterrolebinding.yaml 6.Stop the VM 7.Rescale virt-operator to its original replicas: oc -n openshift-cnv scale deployment virt-operator --replicas=2 Actual results: VirtApiRESTErrorsBurst in Pending state Expected results: VirtApiRESTErrorsBurst in fire state Additional info: Attached a Photo of the metric
We need to drop the evaluation time.
I'm very confused here. Looking at the steps to reproduce the scenario, it appears that virt-api has been left in a non-running state? Its not clear to me what removing its role binding does after its already running. But, the REST API endpoint for this alert is virt-api itself, is it not? Shirly, you mention that we should drop the evaluation time, but it's not clear that will do anything useful. Can you help us understand what needs to be done and why?
I can't really comment on the steps to reproduce. I think this is a question for Ohad. Probably what he was trying to do is get the requests to fail in high %. We need to drop the evaluation time, since in the expression itself we are looking back 5 minutes and checking the % of failed requests. If the failure % is greater than 80% than the alert should fire immediately and not wait for 5m. It the same as VirtApiRESTErrorsHigh.
Targeting this to CNV 4.15 depending upon the severity and anticipated capacity at this point.
Discussed with Virt Devs, targeting it back to 4.14 Target Version.