2022080 – VM Templates let user overcommit their CPU resources

Bug 2022080 - VM Templates let user overcommit their CPU resources

Summary: VM Templates let user overcommit their CPU resources

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Guest Support
Sub Component:
Version:	4.8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Karel Šimon
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-10 17:56 UTC by Jean-Francois Saucier
Modified:	2025-04-04 13:28 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-22 16:28:30 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-14892	0	None	None	None	2022-02-28 07:58:09 UTC

Description Jean-Francois Saucier 2021-11-10 17:56:14 UTC

Description of problem:

Currently, the provided VMs templates let the user easily overcommit the CPU resources from the cluster. For example, in the RHEL 8 template, we ask the user how many CPU he wants and define them using this in the template :

~~~
          cpu:
            sockets: {{ item.cpus }}
            cores: 1
            threads: 1
~~~

However, this can easily lead to CPU overcommit and bad scheduling decision as this configuration does not reserve the actual amount of CPU requested. A user can easily create 4 VMs with 16 CPUs each on a worker node with 16 actual CPUs in it.


Version-Release number of selected component (if applicable):

Tested on CNV 4.8.2 but should affect every version.


How reproducible:

Every time a template is used.


Steps to Reproduce:
1. Create a VM using a template in the web UI
2. Specify the amount of CPUs you want
3. The VMs is created with the amount of CPU specified


Actual results:

The amount of CPUs specified is not reserved and multiple VMs could easily overcommit the CPU of a worker node.


Expected results:

Maybe we can use the following instead in the template (like we do request memory) :

~~~
        resources:
          requests:
            cpu: 8
~~~

Someone also suggested I look at the newer cpuAllocationRatio : https://github.com/kubevirt/kubevirt/pull/4162

Comment 9 Roni Kishner 2021-12-01 09:52:40 UTC

1. Create several vm machines using 
~~~
        resources:
          requests:
            cpu: 4
~~~
This is also mentioned in here: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu

This gave the expected result, the vms were distributed to the nodes according to the usage. once all nodes couldn't handle the amount of the vm cpu the vm creation was halted with an Error: PodDisruptionBudget
The moment the a running vm was deleted (freeing the cpu resource) a new vm could be started.

2. Created several vm machines without requesting cpu. (with cpu manager on and off had the same result)

The machines were created without an issue while going over the node limit - the VM were still being distributed equally as far as i could tell. 
the nodes CPU usage stat using "oc adm top nodes" looked good (avg of 45%), I presume an overcommit of the cpu would occur if i start running processes on the vms and use some cpu power. since they're idle they aren't using much CPU.

Comment 10 Israel Pinto 2021-12-01 11:13:58 UTC

(In reply to Roni Kishner from comment #9)
> 1. Create several vm machines using 
> ~~~
>         resources:
>           requests:
>             cpu: 4
> ~~~
> This is also mentioned in here:
> https://kubernetes.io/docs/concepts/configuration/manage-resources-
> containers/#meaning-of-cpu
> 
> This gave the expected result, the vms were distributed to the nodes
> according to the usage. once all nodes couldn't handle the amount of the vm
> cpu the vm creation was halted with an Error: PodDisruptionBudget
> The moment the a running vm was deleted (freeing the cpu resource) a new vm
> could be started.
What is the behavior with CPU manager on? and off?
> 
> 2. Created several vm machines without requesting cpu. (with cpu manager on
> and off had the same result)
> 
> The machines were created without an issue while going over the node limit -
> the VM were still being distributed equally as far as i could tell. 
> the nodes CPU usage stat using "oc adm top nodes" looked good (avg of 45%),
> I presume an overcommit of the cpu would occur if i start running processes
> on the vms and use some cpu power. since they're idle they aren't using much
> CPU.

From this check i understand that overcommit of CPU is working by can you give more info, like:
1. Number of VMs 
2. How much CPU you set on the VM 
3. Node CPU how much CPU on the node? (run lscpu on the node)

Comment 11 Roni Kishner 2021-12-01 12:18:24 UTC

When putting cpu request on the vm and turning off cpu manager, i noticed the vms are being randomly distributed between the nodes (i could tell that now by the error given). the main behaviour of stopping VM creation on full CPU capacity was still being done, only you had to stop/start a vm until it would land on the available node. 

when i created several vm without cpu request i created 12 VMs each had 3 cores. all of them were running without any issue.
the setup was running on 3 worker nodes, each with 8 cpu. 
this correspond to the previous check where i put request cpu of 3 on the vm, and the vm creation was halted when i tried to create the 7 vm since each node already had 6 cpu in use

*Note: besides vms there are other resources requesting cpu so in theory if we want to create a 16 cpu vm we need a 17 cpu node

Comment 12 Ruth Netser 2021-12-09 11:15:47 UTC

Based on the above, requests.cpu will be added to high performance templates

Comment 13 Ruth Netser 2021-12-09 11:19:10 UTC

The addition needs to be synced with UI
@tnisan

Comment 14 Yaacov Zamir 2022-01-05 12:59:08 UTC

(In reply to Ruth Netser from comment #12)
> Based on the above, requests.cpu will be added to high performance templates

Added as a template parameter ?

(In reply to Ruth Netser from comment #13)
> The addition needs to be synced with UI
> @tnisan

The create virtual machine wizard in the UI support editing requests.cpu buy overriding the static value, even if it's not a parameter, AFAIK no need to update the UI will be needed.

Note: we will need to test the UI with requests.cpu as a parameter, AFAIU it should just work.

Comment 15 Roni Kishner 2022-03-15 14:14:58 UTC

Latest version of 4.10 as of this moment have this issue fixed. 

when creating a VM using the high performance template a VM will be created only if enough CPU is available to use, if there isn't a node with enough cpu, the VM creation will be halted until enough cpu is free.

This is done by applying the "dedicatedCpuPlacement: true" parameter under cpu, In case one does not want to use the high performance template adding this parameter will prevent over commit of cpu, and turning it off (false) will allow over commit

Comment 16 Dominik Holler 2022-03-15 14:18:00 UTC

(In reply to Roni Kishner from comment #15)
> Latest version of 4.10 as of this moment have this issue fixed. 
> 
> when creating a VM using the high performance template a VM will be created
> only if enough CPU is available to use, if there isn't a node with enough
> cpu, the VM creation will be halted until enough cpu is free.
> 
> This is done by applying the "dedicatedCpuPlacement: true" parameter under
> cpu, In case one does not want to use the high performance template adding
> this parameter will prevent over commit of cpu, and turning it off (false)
> will allow over commit

@jsaucier does this solve the issue from your point of view?

Comment 17 Jean-Francois Saucier 2022-03-22 10:38:43 UTC

@dholler yes, this seems to effectively solve the issue on my end!

Comment 18 Dominik Holler 2022-03-22 16:28:30 UTC

@jsaucier Thanks for reporting and the feedback!

Note You need to log in before you can comment on or make changes to this bug.