Bug 2238218

Summary: VirtHandlerRESTErrorsHigh alert in firing state during ocp upgrade z stream 4.12.5 > 4.12.6
Product: Container Native Virtualization (CNV) Reporter: Ahmad <ahafe>
Component: InstallationAssignee: João Vilaça <jvilaca>
Status: CLOSED MIGRATED QA Contact: Ahmad <ahafe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.12.5CC: dbasunag, kmajcher, stirabos
Target Milestone: ---Keywords: Reopened
Target Release: 4.12.9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-12-05 13:42:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ahmad 2023-09-10 12:24:58 UTC
Description of problem:
During ocp upgrade v4.12.5 to 4.12.6 (cnv: v4.12.5-50), noticed alert 'VirtHandlerRESTErrorsHigh' in firing state.

Version-Release number of selected component (if applicable):


How reproducible: 1 of multiple attempts


Steps to Reproduce:
1.upgrade ocp z streams 4.12.5 > 4.12.6
2.Check alerts fired during cnv upgrade

Actual results:

logs:
 [{'labels': {'alertname': 'VirtHandlerRESTErrorsHigh', 'kubernetes_operator_component': 'kubevirt', 'kubernetes_operator_part_of': 'kubevirt', 'severity': 'warning'}, 'annotations': {'runbook_url': 'https://kubevirt.io/monitoring/runbooks/VirtHandlerRESTErrorsHigh', 'summary': 'More than 5% of the rest calls failed in virt-handler for the last hour'}, 'state': 'firing', 'activeAt': '2023-09-14T12:02:04.531593741Z', 'value': '6.1842357154408945e-02'}




Expected results:
no alerts should fire during ocp upgrade process, we are trying to capture all the alerts that are fired during upgrades and reduce the noise generated.

Additional info:
must-gather log attached

Comment 2 Krzysztof Majcher 2023-09-12 12:44:52 UTC
Cannot reproduce at the moment. Will be reopened if needed.

Comment 5 Krzysztof Majcher 2023-09-26 12:44:06 UTC
During fixing we should see how many calls was happening during the whole our before alert fired. 
It's possible that cluster was not very active during that time, and then it only requires just a few failed calls to trigger the alert.
Maybe we should reconsider the alert logic to account for that?

Comment 6 Krzysztof Majcher 2023-09-26 12:44:55 UTC
Maybe it's enough to adjust the threshold.

Comment 7 João Vilaça 2023-10-16 12:35:13 UTC
@kmajcher 

I think this might only happen in the automated tests since the cluster is recently created
Does it make sense to complicate the expression if this is not happening in live clusters?

Comment 8 Krzysztof Majcher 2023-10-17 09:11:01 UTC
Please sync with Debarati and Ahmad if they agree with that.

Comment 9 Krzysztof Majcher 2023-10-17 12:51:37 UTC
We had a short discussion on this bug with Debarati, Simone and Shirly and the agreement is it would be better to have a fix. 
Please sync with them what would be the simplest fix.

Comment 10 Simone Tiraboschi 2023-10-31 13:54:44 UTC
Properly fixing can be problematic, we can add a mitigation note in the runbook saying that this could be visible just after an upgrade.