2095259 – VM with resize_and_pin could not migrate between identical hosts.

Bug 2095259 - VM with resize_and_pin could not migrate between identical hosts.

Summary: VM with resize_and_pin could not migrate between identical hosts.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.5.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.5.1
Target Release:	---
Assignee:	Lucia Jelinkova
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-09 11:35 UTC by Polina
Modified:	2022-06-23 05:54 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ovirt-engine-4.5.1.2
Clone Of:
Environment:
Last Closed:	2022-06-23 05:54:58 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.5? pm-rhel: blocker?

Attachments	(Terms of Use)
engine log (174.80 KB, text/plain) 2022-06-09 11:35 UTC, Polina	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-engine pull 461	0	None	open	engine: Fix the calculation of the VM's CPU cores	2022-06-13 12:36:26 UTC
Red Hat Issue Tracker	RHV-46386	0	None	None	None	2022-06-09 11:38:27 UTC

Description Polina 2022-06-09 11:35:01 UTC

Created attachment 1888315 [details]
engine log

Description of problem:
In the setups which have three identical hosts supporting numa (the same host's topology) the resize_and_pin VM could start on any of them but can't migrate.

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.1-0.62.el8ev.noarch


How reproducible: 100%


Steps to Reproduce:
1. Configure a VM with resize_and_pin policy on setup with three identical hosts supporting numa .
2. Try to start VM on each of thre hosts. The VM could be successfully started on any host in the setup
3. Try to migrate in UI - you see no host that VM could migrate on.
4. Try to migrate by API 

Actual results:
    <fault>
        <detail>[Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host_mixed_1 did not satisfy internal filter CPU because it has an insufficient amount of CPU cores to run the VM.]</detail>
        <reason>Operation Failed</reason>
    </fault


Expected results: VM must migrate


Additional info:

Comment 1 RHEL Program Management 2022-06-09 14:10:15 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 4 Arik 2022-06-09 16:19:34 UTC

So the hosts are filtered out in CPUPolicyUnit#filter - I have a feeling that we count threads as cores for the shared CPUs and end up with a negative number there..

Comment 5 Lucia Jelinkova 2022-06-10 13:12:25 UTC

Hi Arik, you're right. During the refactoring of the CPUPolicyUnit we started to count Vm's CPUs as vm.getNumOfCpus() instead of vm.getNumOfCpus(false). This means that now the threads are included in the cpu count of the VM when scheduling a VM. As a result, a VM with 1:4:5 configuration could run on the host 1:4:2 before the change but not now.

We can fix it in two different ways

a) change the CPUPolicyUnit back to use the vm.getNumOfCpus(false). The problem is that the pending resources and Vm manager (used for determining how many shared cpus need to stay on the host) count the cpuCount also as vm.getNumOfCpus() (they always did, also before the refactoring). I'd suggest to use the same approach in all of them.
b) keep it as it is now and make the resize and pin aware of the countThreadsAsCores setting and not to allocate threads (kepp threads per core = 1) when countThreadsAsCores is false

Comment 7 Arik 2022-06-13 06:43:31 UTC

(In reply to Lucia Jelinkova from comment #5)
> We can fix it in two different ways
> 
> a) change the CPUPolicyUnit back to use the vm.getNumOfCpus(false). The
> problem is that the pending resources and Vm manager (used for determining
> how many shared cpus need to stay on the host) count the cpuCount also as
> vm.getNumOfCpus() (they always did, also before the refactoring). I'd
> suggest to use the same approach in all of them.
> b) keep it as it is now and make the resize and pin aware of the
> countThreadsAsCores setting and not to allocate threads (kepp threads per
> core = 1) when countThreadsAsCores is false

if I understand (b) correctly, it would mean we assign less resources to the VM than we used to, right? in that case, we cannot do that
(a) makes sense

Comment 8 Polina 2022-06-19 12:18:13 UTC

verified on ovirt-engine-4.5.1.2-0.11.el8ev.noarch by running automation tests art/tests/rhevmtests/compute/sla/vm_auto_pinning/vm_auto_pinning_test.py

Comment 9 Sandro Bonazzola 2022-06-23 05:54:58 UTC

This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.