Bug 2217243

Summary: virt-handler memory and cpu usage are hardcoded and set too low for large scale
Product: Container Native Virtualization (CNV) Reporter: Boaz <bbenshab>
Component: VirtualizationAssignee: sgott
Status: NEW --- QA Contact: Kedar Bidarkar <kbidarka>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.13.1CC: fdeutsch, iholder, jhopper, sradco
Target Milestone: ---   
Target Release: 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Boaz 2023-06-25 10:24:58 UTC
I'm running a scale regression setup on :
=========================================
OpenShift 4.13.2
OpenShift Virtualization 4.13.1
OpenShift Container Storage - 4.12.4-rhodf

this is a large-scale setup with 132 nodes running 6000 RHEL VMs on an external RHCS.
after powering up 3000 out of 6000 VMs I opened the UI and see thousands of the following warnings regarding virt-handlers memory consumption:

============================================================================================================================================
KubeVirtComponentExceedsRequestedCPU
25 Jun 2023, 11:58
Pod virt-handler-hmztx cpu usage exceeds the CPU requested
View details
KubeVirtComponentExceedsRequestedCPU
25 Jun 2023, 11:58
Pod virt-handler-b7jfm cpu usage exceeds the CPU requested
View details
KubeVirtComponentExceedsRequestedMemory
25 Jun 2023, 11:58
Container virt-handler in pod virt-handler-66x4l memory usage exceeds the memory requested
View details

============================================================================================================================================

from oc adm top:

============================================================================================================================================

virt-handler-x46kr                                     12m          314Mi
virt-handler-pld9d                                     21m          315Mi
virt-handler-sfqnh                                     17m          316Mi
virt-handler-dlh4w                                     26m          317Mi
virt-handler-lbfj7                                     17m          317Mi
virt-handler-tcx9l                                     24m          319Mi
virt-handler-fggsg                                     18m          321Mi
virt-handler-7gzm8                                     17m          325Mi
virt-handler-lk9bp                                     12m          325Mi
virt-handler-gcwfh                                     18m          329Mi
============================================================================================================================================


if you look at the screenshot I attached you will see that because of it the status of OpenShift Virtualization is at a "Degraded" state
in addition, the thousands of warnings cause the console to significantly slow down and freeze.


I collected the logs but I found it odd that I could not see the above events via cli, note this is a 44G folder when extracted:
============================================================================================================================================

http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/virthandler_mem_cpu_too_low.tar.gz

Comment 3 Fabian Deutsch 2023-06-28 12:59:42 UTC
For now I'm assuming that the PDB link is a red herring.

Boaz, how much cpu and memory consumption do you see the virt handler using?
My take would be that we take tehse values and use them+small delta as the defaults for our handler.

Comment 4 Boaz 2023-06-28 14:55:20 UTC
Hey @fdeutsch 
I'm seeing 330-350 MB memory allocations
and 40m -92m CPU used
its fine to set those values+ headroom as a baseline but I would also like to have those tunable.

Comment 5 Fabian Deutsch 2023-06-28 15:02:40 UTC
Thanks Boaz.

Yes, it would be great if we can identify a way of how we could derive these values from
- static cluster properties (node count, cpu count, network bw, …)
- or target workloads properties (expected churn, expected num vms, …)

We probably want to have this in a dedictaed epic, just like "following the kubelet" for api server

Comment 9 Itamar Holder 2023-07-03 10:22:22 UTC
A general note:
I think we should treat ExceedsRequestedCPU and ExceedsRequestedMemory completely differently since they are not of the same severity.

ExceedsRequestedMemory is dangerous, since memory is an uncompressible resource. This means that if the node would be stressed w.r.t. memory, virt-handler might get killed, which is bad.

But with ExceedsRequestedCPU things are completely different, since CPU is a compressible resource. If the node gets stressed w.r.t. CPU, in the worst case scenario virt-handler would be throttled to use only the amount of CPU it requested. If virt-handler needs more CPU than requested (this might only happen for a certain period of time) then the node will allow this as long as it has spare CPU resources. This completely aligns with Kubernetes (and cgroups) resource management model and is completely valid.

The only reason to think there's a problem with virt-handler's request is if we are sure that it permanently (or at least for long periods of time) requires more CPU to do its job. But I think we need more data to actually be sure of that.

We might even consider raising the fire time for KubeVirtComponentExceedsRequestedCPU, which is currently 5 minutes.

Comment 11 Kedar Bidarkar 2023-07-12 12:25:49 UTC
The plan is to address this issue mentioned in the bug description as part of the Jira Epic, https://issues.redhat.com/browse/CNV-28746 which is currently Targeted for CY24.