Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1948970

Summary: [RFE] Numa pinning does not work with dynamic huge page allocations
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: ovirt-engineAssignee: Shmuel Melamud <smelamud>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.4.4CC: ahadas, smelamud
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-31 12:33:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Hodain 2021-04-13 07:39:55 UTC
Description of problem:
The scheduler does not allow VMs to start if they are pinned to a numa node and they use dynamic huge page allocation. 

Version-Release number of selected component (if applicable):
4.4.4

How reproducible:
100%

Steps to Reproduce:
1.Create a VM
2.Set huge pages size to 2048 in Custom Properties 
3. Create a file /etc/vdsm/vdsm.conf.d/huge.conf with content:
    [performance]
    use_dynamic_hugepages = true
4. Restart vdsm

Actual results:
VM does not start

Expected results:
VM starts with huge pages allocated

Additional info:
The issue is related to the way the huge pages are allocated. It is allocated by vdsm during the VM start, but the scheduler checks the huge pages available in a specific node before the VM is started and the huge pages are not available at that time. 
Another problem is that in case vdsm allocates memory by writing number of required pages into:

    /sys/kernel/mm/hugepages/hugepages-{}kB/nr_hugepages

That distributes the pages across all nodes so in the case of 20 pages allocated and system with 2 nodes. There will be 10 pages in numa0 and 10 in numa1. If the VM is pinned to just one of the nodes the VM will not start as it will have just half of the memory available there. The pages should be allocated based on the pinning and the allocation shoudl be done on specific nodes:

/sys/devices/system/node/nodeX/hugepages/hugepages-{}kB/nr_hugepages

Comment 2 Liran Rotenberg 2021-04-26 12:52:01 UTC
This is 2 RFEs from my point of view:

1. Enhance VDSM dynamic allocation - currently we set the hugepages in the OS without specifying nodes, and it will be distributed between them. We don't take care about NUMA pinning. In case it has pinning and it doesn't use all the NUMA nodes, it will harm us.
So, we need to detect in VDSM if the VM has NUMA pinning, if it does to allocate hugepages to fit the pinning.

2. Let the engine schedule the VM with dynamic hugepages and NUMA pinning. As written we need to think of it as "provided" and skip the hugepages check. If we will fail to allocate them dynamically the QEMU will probably fail to start. For this, we might need to add a value to the VM, that tells the engine we are using dynamic hugepages allocations (as we don't know it). We may even consider passing it to VDSM, and select the dynamic hugepages for that VM without the need to change VDSM config. Or, report it back from VDSM (i.e. host caps) to let the engine consume and know this host uses dynamic hugepages.

Comment 3 Liran Rotenberg 2021-04-26 13:32:42 UTC
And another note to add - currently the dynamic allocation in VDSM does it for 2MB and 16MB (PPC) hugepages. We need to consider if and how to add 1GB/16GB.

Comment 4 Arik 2021-05-10 11:47:05 UTC
That makes sense, thanks Liran
Let's separate the enhancements from the bug fix -
Shmuel, can you please check that it doesn't fail with numa tuning != strict?

Comment 5 Arik 2021-05-31 12:33:04 UTC
(In reply to Arik from comment #4)
> That makes sense, thanks Liran
> Let's separate the enhancements from the bug fix -
> Shmuel, can you please check that it doesn't fail with numa tuning != strict?

Looking at bz 1810853, it appears that it won't.

*** This bug has been marked as a duplicate of bug 1810853 ***