1810853 – [RFE] VDSM dynamic huge pages allocation is not NUMA aware

Bug 1810853 - [RFE] VDSM dynamic huge pages allocation is not NUMA aware

Summary: [RFE] VDSM dynamic huge pages allocation is not NUMA aware

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.3.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Milan Zamazal
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1948970 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-06 03:07 UTC by Germano Veit Michel
Modified:	2022-04-10 08:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-08 16:50:58 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Germano Veit Michel 2020-03-06 03:07:50 UTC

Description of problem:

When VDSM dynamically allocates huge pages, it simply writes to /sys/kernel/mm/hugepages/hugepages-{}kB/nr_hugepages with the number of total HPs required by the VM.

The kernel will distribute the hugepages among the numa nodes allowed for the calling PID (supervdsm):
~~~
On a NUMA platform, the kernel will attempt to distribute the huge page pool
over all the set of allowed nodes specified by the NUMA memory policy of the
task that modifies nr_hugepages.  The default for the allowed nodes--when the
task has default memory policy--is all on-line nodes with memory.  Allowed
nodes with insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages.  See the discussion
below of the interaction of task memory policy, cpusets and per node attributes
with the allocation and freeing of persistent huge pages.
~~~
Source: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

So, on 4 a NODE machine, allocating 1024 2M HPs results in 256 2M HPs per NUMA node:

2020-03-06 12:54:17,234+1000 INFO  (vm/8f54d8f6) [virt.vm] (vmId='8f54d8f6-3a16-49cd-980d-73ec684796c5') Allocating 1024 (2048) hugepages (memsize 2097152) (vm:2270)

$ for i in {0..3} ; do cat /sys/devices/system/node/node$i/hugepages/hugepages-2048kB/nr_hugepages; done
256
256
256
256

This is not nice for a few reasons:
* The VM memory will be distributed among all host NUMA nodes, even if it fits into one. This can cause performance issues.
* If the VM has NUMA pinning and set to strict, it will fail to start, as there will not be enough HPs on the particular node.

Version-Release number of selected component (if applicable):
vdsm-4.30.40-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. 4 NUMA node RHVH, with dynamic hugepages enabled

# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 2047 MB
node 0 free: 1833 MB
node 1 cpus: 1
node 1 size: 2048 MB
node 1 free: 1806 MB
node 2 cpus: 2
node 2 size: 2048 MB
node 2 free: 1860 MB
node 3 cpus: 3
node 3 size: 2048 MB
node 3 free: 1888 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10 

# cat /etc/vdsm/vdsm.conf
[vars]
ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1
ssl = true
ssl_ciphers = HIGH:!aNULL

2. VM with 2 vNUMA nodes using 2M hugepages, pinned to the host above

<custom_properties>
  <custom_property>
    <name>hugepages</name>
    <value>2048</value>
  </custom_property>
</custom_properties>

<numa_tune_mode>strict</numa_tune_mode>

<vm_numa_nodes>
  <vm_numa_node href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5/numanodes/4f3916bd-ee0b-4492-9ca9-115040597f76" id="4f3916bd-ee0b-4492-9ca9-115040597f76">
    <cpu>
      <cores>
        <core>
          <index>0</index>
        </core>
        <core>
          <index>1</index>
        </core>
      </cores>
    </cpu>
    <index>0</index>
    <memory>1024</memory>
    <numa_node_pins>
      <numa_node_pin>
        <index>1</index>
      </numa_node_pin>
    </numa_node_pins>
    <vm href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5" id="8f54d8f6-3a16-49cd-980d-73ec684796c5"/>
  </vm_numa_node>
  <vm_numa_node href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5/numanodes/3cd156e1-3d7c-4c0b-9dab-f66bbf8bc743" id="3cd156e1-3d7c-4c0b-9dab-f66bbf8bc743">
    <cpu>
      <cores>
        <core>
          <index>2</index>
        </core>
        <core>
          <index>3</index>
        </core>
      </cores>
    </cpu>
    <index>1</index>
    <memory>1024</memory>
    <numa_node_pins>
      <numa_node_pin>
        <index>2</index>
      </numa_node_pin>
    </numa_node_pins>
    <vm href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5" id="8f54d8f6-3a16-49cd-980d-73ec684796c5"/>
  </vm_numa_node>
</vm_numa_nodes>

3. Start VM
...
    <cpu match="exact">
        <model>Skylake-Client</model>
        <topology cores="1" sockets="16" threads="1" />
        <numa>
            <cell cpus="0,1" id="0" memory="1048576" />
            <cell cpus="2,3" id="1" memory="1048576" />
        </numa>
    </cpu>
...

2020-03-06 12:54:20,053+1000 ERROR (vm/8f54d8f6) [virt.vm] (vmId='8f54d8f6-3a16-49cd-980d-73ec684796c5') The vm start process failed (vm:933)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2891, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2020-03-06T02:54:19.271151Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/6-RHEL7,size=1073741824,host-nodes=1,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM

So what happened is that VDSM allocated:
NUMA 0: 256 2M HPs
NUMA 1: 256 2M HPs
NUMA 2: 256 2M HPs
NUMA 3: 256 2M HPs

But the VM needed:
NUMA 0: 0 2M HPs
NUMA 1: 512 2M HPs
NUMA 2: 512 2M HPs
NUMA 3: 0 2M HPs

If the numa policy is strict, the VM fails to start. Otherwise it starts but gets performance penalties.

Actual results:
* VM fails to start if numa policy is strict
* Suboptimal memory allocation

Expected results:
* VM always starts
* Optimal allocation from NUMA node

Comment 1 Ryan Barry 2020-03-07 00:54:16 UTC

We'll try to 4.4.z, but targeting for 4.5 in case it's tricky

Comment 2 Arik 2021-05-31 12:33:04 UTC

*** Bug 1948970 has been marked as a duplicate of this bug. ***

Comment 3 Michal Skrivanek 2021-08-20 11:02:09 UTC

This bug/RFE is more than a year old and it didn't get enough attention so far, and is now flagged as pending close. 
Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release.

Comment 4 Milan Zamazal 2021-08-31 15:18:54 UTC

A related question is whether we may have a similar problem with static huge pages. AFAIK we don't specify NUMA nodes in <hugepages> and rely on the host OS to do the right thing. Which may be good enough unless static huge pages are combined with dynamic huge pages, there are not enough free static pages in the corresponding NUMA nodes, there is still free memory in the given NUMA nodes, and there are free static huge pages in other NUMA nodes. In such a case, the memory setup for the given VM will be also suboptimal, compared to allocating dynamic huge pages in the right NUMA nodes, in addition to what is already allocated elsewhere. But it can be argued that we needn't consider such a case (and static huge pages) -- if the users wanted having allocated more huge pages than strictly necessary, they would allocate as many static huge pages as possible in advance to avoid this problem.

Comment 5 Michal Skrivanek 2021-09-29 11:33:06 UTC

This bug didn't get any attention in a long time, and it's not planned in foreseeable future. oVirt development team has no plans to work on it.
Please feel free to reopen if you have a plan how to contribute this feature/bug fix.

Comment 6 Sandro Bonazzola 2022-03-29 16:16:40 UTC

We are past 4.5.0 feature freeze, please re-target.

Comment 7 Michal Skrivanek 2022-04-08 16:50:58 UTC

no updates for a long time(again), missed release GA(again), closing(again)

Note You need to log in before you can comment on or make changes to this bug.