Created attachment 1888315 [details] engine log Description of problem: In the setups which have three identical hosts supporting numa (the same host's topology) the resize_and_pin VM could start on any of them but can't migrate. Version-Release number of selected component (if applicable): ovirt-engine-4.5.1-0.62.el8ev.noarch How reproducible: 100% Steps to Reproduce: 1. Configure a VM with resize_and_pin policy on setup with three identical hosts supporting numa . 2. Try to start VM on each of thre hosts. The VM could be successfully started on any host in the setup 3. Try to migrate in UI - you see no host that VM could migrate on. 4. Try to migrate by API Actual results: <fault> <detail>[Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:, The host host_mixed_1 did not satisfy internal filter CPU because it has an insufficient amount of CPU cores to run the VM.]</detail> <reason>Operation Failed</reason> </fault Expected results: VM must migrate Additional info:
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
So the hosts are filtered out in CPUPolicyUnit#filter - I have a feeling that we count threads as cores for the shared CPUs and end up with a negative number there..
Hi Arik, you're right. During the refactoring of the CPUPolicyUnit we started to count Vm's CPUs as vm.getNumOfCpus() instead of vm.getNumOfCpus(false). This means that now the threads are included in the cpu count of the VM when scheduling a VM. As a result, a VM with 1:4:5 configuration could run on the host 1:4:2 before the change but not now. We can fix it in two different ways a) change the CPUPolicyUnit back to use the vm.getNumOfCpus(false). The problem is that the pending resources and Vm manager (used for determining how many shared cpus need to stay on the host) count the cpuCount also as vm.getNumOfCpus() (they always did, also before the refactoring). I'd suggest to use the same approach in all of them. b) keep it as it is now and make the resize and pin aware of the countThreadsAsCores setting and not to allocate threads (kepp threads per core = 1) when countThreadsAsCores is false
(In reply to Lucia Jelinkova from comment #5) > We can fix it in two different ways > > a) change the CPUPolicyUnit back to use the vm.getNumOfCpus(false). The > problem is that the pending resources and Vm manager (used for determining > how many shared cpus need to stay on the host) count the cpuCount also as > vm.getNumOfCpus() (they always did, also before the refactoring). I'd > suggest to use the same approach in all of them. > b) keep it as it is now and make the resize and pin aware of the > countThreadsAsCores setting and not to allocate threads (kepp threads per > core = 1) when countThreadsAsCores is false if I understand (b) correctly, it would mean we assign less resources to the VM than we used to, right? in that case, we cannot do that (a) makes sense
verified on ovirt-engine-4.5.1.2-0.11.el8ev.noarch by running automation tests art/tests/rhevmtests/compute/sla/vm_auto_pinning/vm_auto_pinning_test.py
This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.