Bug 2238218

Summary:	VirtHandlerRESTErrorsHigh alert in firing state during ocp upgrade z stream 4.12.5 > 4.12.6
Product:	Container Native Virtualization (CNV)	Reporter:	Ahmad <ahafe>
Component:	Installation	Assignee:	João Vilaça <jvilaca>
Status:	CLOSED MIGRATED	QA Contact:	Ahmad <ahafe>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.12.5	CC:	dbasunag, kmajcher, stirabos
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.12.9
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-12-05 13:42:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ahmad 2023-09-10 12:24:58 UTC

Description of problem:
During ocp upgrade v4.12.5 to 4.12.6 (cnv: v4.12.5-50), noticed alert 'VirtHandlerRESTErrorsHigh' in firing state.

Version-Release number of selected component (if applicable):


How reproducible: 1 of multiple attempts


Steps to Reproduce:
1.upgrade ocp z streams 4.12.5 > 4.12.6
2.Check alerts fired during cnv upgrade

Actual results:

logs:
 [{'labels': {'alertname': 'VirtHandlerRESTErrorsHigh', 'kubernetes_operator_component': 'kubevirt', 'kubernetes_operator_part_of': 'kubevirt', 'severity': 'warning'}, 'annotations': {'runbook_url': 'https://kubevirt.io/monitoring/runbooks/VirtHandlerRESTErrorsHigh', 'summary': 'More than 5% of the rest calls failed in virt-handler for the last hour'}, 'state': 'firing', 'activeAt': '2023-09-14T12:02:04.531593741Z', 'value': '6.1842357154408945e-02'}




Expected results:
no alerts should fire during ocp upgrade process, we are trying to capture all the alerts that are fired during upgrades and reduce the noise generated.

Additional info:
must-gather log attached

Comment 2 Krzysztof Majcher 2023-09-12 12:44:52 UTC

Cannot reproduce at the moment. Will be reopened if needed.

Comment 5 Krzysztof Majcher 2023-09-26 12:44:06 UTC

During fixing we should see how many calls was happening during the whole our before alert fired. 
It's possible that cluster was not very active during that time, and then it only requires just a few failed calls to trigger the alert.
Maybe we should reconsider the alert logic to account for that?

Comment 6 Krzysztof Majcher 2023-09-26 12:44:55 UTC

Maybe it's enough to adjust the threshold.

Comment 7 João Vilaça 2023-10-16 12:35:13 UTC

@kmajcher 

I think this might only happen in the automated tests since the cluster is recently created
Does it make sense to complicate the expression if this is not happening in live clusters?

Comment 8 Krzysztof Majcher 2023-10-17 09:11:01 UTC

Please sync with Debarati and Ahmad if they agree with that.

Comment 9 Krzysztof Majcher 2023-10-17 12:51:37 UTC

We had a short discussion on this bug with Debarati, Simone and Shirly and the agreement is it would be better to have a fix. 
Please sync with them what would be the simplest fix.

Comment 10 Simone Tiraboschi 2023-10-31 13:54:44 UTC

Properly fixing can be problematic, we can add a mitigation note in the runbook saying that this could be visible just after an upgrade.