Bug 2179991

Summary:

VirtApiRESTErrorsBurst threshold high

Product:

Container Native Virtualization (CNV)

Reporter:

Ohad <orevah>

Component:

Virtualization

Assignee:

Igor Bezukh <ibezukh>

Status:

CLOSED NOTABUG

QA Contact:

Akriti Gupta <akrgupta>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.13.0

CC:

acardace, kedar.lad, sradco, stirabos

Target Milestone:

---

Target Release:

4.14.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

hco-bundle-v4.14.0.rhel9-2029

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-10-05 07:47:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
A Photo of the metric	none

Description Ohad 2023-03-20 14:16:40 UTC

Created attachment 1952088 [details]
A Photo of the metric

Created attachment 1952088 [details]
A Photo of the metric

Created attachment 1952088 [details]
A Photo of the metric

Description of problem:
When trying to validate the VirtApiRESTErrorsBurst critical alerts with automation, this alert is not fired and stays on pending, when the rest calls is getting little lower than 0.8 it counting again for the 5 mins so the alert can't get to the firing state withing reasonable time, it can take more than 20 mins.



How reproducible:
This is the steps:
1.Scale virt-operator to zero (Otherwise, it will revert our changes to observe the alert),
oc -n openshift-cnv scale deployment virt-operator --replicas=0
2.Run a VM
3.Backup clusterrolebinding for kubevirt-apiserver,
oc get clusterrolebinding kubevirt-apiserver -o yaml > kubevirt-apiserver-clusterrolebinding.yaml
4.Delete clusterrolebinding for virt-api:
oc delete clusterrolebinding kubevirt-apiserver
Now Wait for a while and observe the alert.
5.Recreate the clusterrolebinding:  oc apply -f kubevirt-apiserver-clusterrolebinding.yaml
6.Stop the VM
7.Rescale virt-operator to its original replicas: oc -n openshift-cnv scale deployment virt-operator --replicas=2

Actual results:
VirtApiRESTErrorsBurst in Pending state

Expected results:
VirtApiRESTErrorsBurst in fire state

Additional info:
Attached a Photo of the metric

Comment 1 Shirly Radco 2023-03-20 14:26:44 UTC

We need to drop the evaluation time.

Comment 2 sgott 2023-03-22 13:42:58 UTC

I'm very confused here. Looking at the steps to reproduce the scenario, it appears that virt-api has been left in a non-running state? Its not clear to me what removing its role binding does after its already running. But, the REST API endpoint for this alert is virt-api itself, is it not?

Shirly, you mention that we should drop the evaluation time, but it's not clear that will do anything useful. Can you help us understand what needs to be done and why?

Comment 3 Shirly Radco 2023-03-29 12:47:11 UTC

I can't really comment on the steps to reproduce.
I think this is a question for Ohad. Probably what he was trying to do is get the requests to fail in high %.

We need to drop the evaluation time, since in the expression itself we are looking back 5 minutes and checking the % of failed requests.
If the failure % is greater than 80% than the alert should fire immediately and not wait for 5m. It the same as VirtApiRESTErrorsHigh.

Comment 4 Kedar Bidarkar 2023-04-05 12:23:32 UTC

Targeting this to CNV 4.15 depending upon the severity and anticipated capacity at this point.

Comment 7 Kedar Bidarkar 2023-04-26 12:12:21 UTC

Discussed with Virt Devs, targeting it back to 4.14 Target Version.

Comment 8 Ohad 2023-10-05 14:15:44 UTC

Closed NOTABUG, when disabling also the olm with the virt-operator the virt-operator not reconcile then the cluster-role-binding not reconciled so the rest-calls keep failing so I managed to trigger the alert without problems, alsoI added more 5-mins to wait for the alert because the time of the setup to make the rest-calls fail 80%.

Comment 9 Red Hat Bugzilla 2024-02-03 04:25:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days