Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1810853 - [RFE] VDSM dynamic huge pages allocation is not NUMA aware
Summary: [RFE] VDSM dynamic huge pages allocation is not NUMA aware
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.8
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.5.0
: ---
Assignee: Milan Zamazal
QA Contact: meital avital
URL:
Whiteboard:
: 1948970 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-06 03:07 UTC by Germano Veit Michel
Modified: 2021-07-27 11:28 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Virt
Target Upstream Version:


Attachments (Terms of Use)

Description Germano Veit Michel 2020-03-06 03:07:50 UTC
Description of problem:

When VDSM dynamically allocates huge pages, it simply writes to /sys/kernel/mm/hugepages/hugepages-{}kB/nr_hugepages with the number of total HPs required by the VM.

The kernel will distribute the hugepages among the numa nodes allowed for the calling PID (supervdsm):
~~~
On a NUMA platform, the kernel will attempt to distribute the huge page pool
over all the set of allowed nodes specified by the NUMA memory policy of the
task that modifies nr_hugepages.  The default for the allowed nodes--when the
task has default memory policy--is all on-line nodes with memory.  Allowed
nodes with insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages.  See the discussion
below of the interaction of task memory policy, cpusets and per node attributes
with the allocation and freeing of persistent huge pages.
~~~
Source: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

So, on 4 a NODE machine, allocating 1024 2M HPs results in 256 2M HPs per NUMA node:

2020-03-06 12:54:17,234+1000 INFO  (vm/8f54d8f6) [virt.vm] (vmId='8f54d8f6-3a16-49cd-980d-73ec684796c5') Allocating 1024 (2048) hugepages (memsize 2097152) (vm:2270)

$ for i in {0..3} ; do cat /sys/devices/system/node/node$i/hugepages/hugepages-2048kB/nr_hugepages; done
256
256
256
256

This is not nice for a few reasons:
* The VM memory will be distributed among all host NUMA nodes, even if it fits into one. This can cause performance issues.
* If the VM has NUMA pinning and set to strict, it will fail to start, as there will not be enough HPs on the particular node.

Version-Release number of selected component (if applicable):
vdsm-4.30.40-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. 4 NUMA node RHVH, with dynamic hugepages enabled

# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 2047 MB
node 0 free: 1833 MB
node 1 cpus: 1
node 1 size: 2048 MB
node 1 free: 1806 MB
node 2 cpus: 2
node 2 size: 2048 MB
node 2 free: 1860 MB
node 3 cpus: 3
node 3 size: 2048 MB
node 3 free: 1888 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10 

# cat /etc/vdsm/vdsm.conf
[vars]
ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1
ssl = true
ssl_ciphers = HIGH:!aNULL

2. VM with 2 vNUMA nodes using 2M hugepages, pinned to the host above

<custom_properties>
  <custom_property>
    <name>hugepages</name>
    <value>2048</value>
  </custom_property>
</custom_properties>

<numa_tune_mode>strict</numa_tune_mode>

<vm_numa_nodes>
  <vm_numa_node href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5/numanodes/4f3916bd-ee0b-4492-9ca9-115040597f76" id="4f3916bd-ee0b-4492-9ca9-115040597f76">
    <cpu>
      <cores>
        <core>
          <index>0</index>
        </core>
        <core>
          <index>1</index>
        </core>
      </cores>
    </cpu>
    <index>0</index>
    <memory>1024</memory>
    <numa_node_pins>
      <numa_node_pin>
        <index>1</index>
      </numa_node_pin>
    </numa_node_pins>
    <vm href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5" id="8f54d8f6-3a16-49cd-980d-73ec684796c5"/>
  </vm_numa_node>
  <vm_numa_node href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5/numanodes/3cd156e1-3d7c-4c0b-9dab-f66bbf8bc743" id="3cd156e1-3d7c-4c0b-9dab-f66bbf8bc743">
    <cpu>
      <cores>
        <core>
          <index>2</index>
        </core>
        <core>
          <index>3</index>
        </core>
      </cores>
    </cpu>
    <index>1</index>
    <memory>1024</memory>
    <numa_node_pins>
      <numa_node_pin>
        <index>2</index>
      </numa_node_pin>
    </numa_node_pins>
    <vm href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5" id="8f54d8f6-3a16-49cd-980d-73ec684796c5"/>
  </vm_numa_node>
</vm_numa_nodes>

3. Start VM
...
    <cpu match="exact">
        <model>Skylake-Client</model>
        <topology cores="1" sockets="16" threads="1" />
        <numa>
            <cell cpus="0,1" id="0" memory="1048576" />
            <cell cpus="2,3" id="1" memory="1048576" />
        </numa>
    </cpu>
...

2020-03-06 12:54:20,053+1000 ERROR (vm/8f54d8f6) [virt.vm] (vmId='8f54d8f6-3a16-49cd-980d-73ec684796c5') The vm start process failed (vm:933)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2891, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2020-03-06T02:54:19.271151Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/6-RHEL7,size=1073741824,host-nodes=1,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM

So what happened is that VDSM allocated:
NUMA 0: 256 2M HPs
NUMA 1: 256 2M HPs
NUMA 2: 256 2M HPs
NUMA 3: 256 2M HPs

But the VM needed:
NUMA 0: 0 2M HPs
NUMA 1: 512 2M HPs
NUMA 2: 512 2M HPs
NUMA 3: 0 2M HPs

If the numa policy is strict, the VM fails to start. Otherwise it starts but gets performance penalties.

Actual results:
* VM fails to start if numa policy is strict
* Suboptimal memory allocation

Expected results:
* VM always starts
* Optimal allocation from NUMA node

Comment 1 Ryan Barry 2020-03-07 00:54:16 UTC
We'll try to 4.4.z, but targeting for 4.5 in case it's tricky

Comment 2 Arik 2021-05-31 12:33:04 UTC
*** Bug 1948970 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.