2070536 – The same host CPUs are assigned twice when we run dedicated&none&resize VMs on the same host

Bug 2070536 - The same host CPUs are assigned twice when we run dedicated&none&resize VMs on the same host

Summary: The same host CPUs are assigned twice when we run dedicated&none&resize VMs o...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.5.0
Target Release:	4.5.0.1
Assignee:	Liran Rotenberg
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-31 11:56 UTC by Polina
Modified:	2022-04-28 09:26 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ovirt-engine-4.5.0.1
Clone Of:
Environment:
Last Closed:	2022-04-28 09:26:34 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.5?

Attachments	(Terms of Use)
engine log dump xmls (681.16 KB, application/gzip) 2022-03-31 11:56 UTC, Polina	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-engine pull 219	0	None	open	Filter resize and pin when CPUs taken	2022-04-03 13:36:32 UTC
Red Hat Issue Tracker	RHV-45503	0	None	None	None	2022-03-31 12:06:50 UTC

Description Polina 2022-03-31 11:56:42 UTC

Created attachment 1869690 [details]
engine log dump xmls

Description of problem:
engine must not run resize policy VM when there are no free host CPUs for it.

Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:
1. On host with topology 1:4:1 (sockets:cores:threads) run one dedicated 1:2:1 and one none policy VM 1:2:1 - run successfully
2. Try to run on the same host resize_and_pin policy VM . Expected it to fail

but it was run and we have now the same CPUs assigned twice
root@lynx09 ~]# virsh -r dumpxml 30 |grep vcpu
  <vcpu placement='static'>3</vcpu>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='2'/>
    <vcpupin vcpu='2' cpuset='3'/>
[root@lynx09 ~]# virsh -r dumpxml 29 |grep vcpu
  <vcpu placement='static' current='2'>32</vcpu>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='2'/>
[root@lynx09 ~]# virsh -r dumpxml 25 |grep vcpu
  <vcpu placement='static' current='2'>32</vcpu>
    <vcpupin vcpu='0' cpuset='1,3'/>
    <vcpupin vcpu='1' cpuset='1,3'/>

30 - resize VM
29 - dedicated
25 - none

Actual results: resize policy VM is started while the host has no free CPUs to reside it 


Expected results:the resize_and_pin VM must not start on the same host


Additional info: 
1. attached dump.xmls for three VMs
2. the scenario could be easily reproduced on hosted-engine-04.lab.eng.tlv2.redhat.com where we have hosts with topology 1:4:1

Comment 1 Arik 2022-03-31 14:07:45 UTC

Liran, should be fixed in the beta version, no?

Comment 2 Polina 2022-04-03 08:50:21 UTC

add another example - 

On host 2:8:2 (serval18.lab.eng.tlv2.redhat.com) I run three dedicated VMs , each one 1:8:1 , so I have left future cpus = 16 - 24/2.
I run resize_and_pin VM expecting it will take 16-12=3 (1 for host).
This resize_and_pin VM runs with topology 2:7:2 under dynamic_cpu and 5sockets:1:1  under cpu .and assigned the pCPUs which are already taken by other dedicated VMs

pCPU assigned:
[root@serval18 ~]# grep "cpuset="  VM1.dumpxml |awk -F\ "cpuset='" '{print $2}' | rev | cut -c 4- | rev|sort -n
0
2
4
6
16
18
20
22
root@serval18 ~]# grep "cpuset="  VM2.dumpxml |awk -F\ "cpuset='" '{print $2}' | rev | cut -c 4- | rev |sort -n
3
5
7
9
19
21
23
25

[root@serval18 ~]# grep "cpuset="  VM3.dumpxml |awk -F\ "cpuset='" '{print $2}' | rev | cut -c 4- | rev |sort -n
8
10
12
14
24
26
28
30

[root@serval18 ~]# grep "cpuset="  VM4_resize_and_pin.dumpxml |awk -F\ "cpuset='" '{print $2}' | rev | cut -c 4- | rev |sort -n
2,18
2,18
3,19
3,19
4,20
4,20
5,21
5,21
6,22
6,22
7,23
7,23
8,24
8,24
9,25
9,25
10,26
10,26
11,27
11,27
12,28
12,28
13,29
13,29
14,30
14,30
15,31
15,31

Comment 3 Liran Rotenberg 2022-04-03 09:10:16 UTC

Resize and pin shouldn't fit based on dedicate, it should consume the host resources.
I think we will need to forbid running dedicated combines with resize and pin vm. If the resize and pin VM already running - we already should hit a filter.
The other way around is more problematic because the timing we filter and set the vm dynamic details (only when running the vm).

(In reply to Arik from comment #1)
> Liran, should be fixed in the beta version, no?

That would be the best.

Comment 4 Arik 2022-04-03 10:40:04 UTC

So we don't take into account the exclusively pinned resources when scheduling VMs with resize-and-pin - doing that should be relatively simple

Comment 5 Polina 2022-04-18 10:51:34 UTC

it is verified on ovirt-engine-4.5.0.2-0.7.el8ev.noarch
resize policy VM is forbidden to run together with dedicated. event if there are CPU resources. that's why I'm not sure that the error text we get now is good. 

"the host host_mixed_1 did not satisfy internal filter CpuPinning because doesn't have enough CPUs for the resize and pin NUMA CPU policy that the VM is set with."

such error is returned in any case of simultaneous launch of dedicated and resize VM on a host even if dedicated is set with 1 CPU . maybe the error should say about exclusively running rule?

Comment 6 Liran Rotenberg 2022-04-24 07:04:04 UTC

We forbidden any run of `Resize and Pin NUMA` when a dedicated VM runs on that host. Basically the message is true. `Resize and Pin NUMA` is consuming all the resources on that host, those we don't pin (one core) is to leave the host with some breathing space.
Based on that logic, you don't have enough CPUs even when 1 CPU is taken. But yes, we may change the message if that seems important.
I am not sure in that case we need another bug or using this one (IMO, a new one), Arik?

Comment 7 Arik 2022-04-24 07:23:57 UTC

(In reply to Liran Rotenberg from comment #6)
> I am not sure in that case we need another bug or using this one (IMO, a new
> one), Arik?

Yes, a new one with lower severity
Users won't necessarily know the accurate meaning of the "resize" part and may not notice that a VM with exclusive pinning runs on the host, so it would be better to be more explicit about it (.. because some of the physical CPUs on the host are exclusively pinned), even if just to ease debugging by us

Comment 8 Polina 2022-04-24 09:39:13 UTC

reported low priority bz https://bugzilla.redhat.com/show_bug.cgi?id=2078189

this one is closed on the base of https://bugzilla.redhat.com/show_bug.cgi?id=2070536#c5

Comment 9 Sandro Bonazzola 2022-04-28 09:26:34 UTC

This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.