Description of problem: When VDSM dynamically allocates huge pages, it simply writes to /sys/kernel/mm/hugepages/hugepages-{}kB/nr_hugepages with the number of total HPs required by the VM. The kernel will distribute the hugepages among the numa nodes allowed for the calling PID (supervdsm): ~~~ On a NUMA platform, the kernel will attempt to distribute the huge page pool over all the set of allowed nodes specified by the NUMA memory policy of the task that modifies nr_hugepages. The default for the allowed nodes--when the task has default memory policy--is all on-line nodes with memory. Allowed nodes with insufficient available, contiguous memory for a huge page will be silently skipped when allocating persistent huge pages. See the discussion below of the interaction of task memory policy, cpusets and per node attributes with the allocation and freeing of persistent huge pages. ~~~ Source: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt So, on 4 a NODE machine, allocating 1024 2M HPs results in 256 2M HPs per NUMA node: 2020-03-06 12:54:17,234+1000 INFO (vm/8f54d8f6) [virt.vm] (vmId='8f54d8f6-3a16-49cd-980d-73ec684796c5') Allocating 1024 (2048) hugepages (memsize 2097152) (vm:2270) $ for i in {0..3} ; do cat /sys/devices/system/node/node$i/hugepages/hugepages-2048kB/nr_hugepages; done 256 256 256 256 This is not nice for a few reasons: * The VM memory will be distributed among all host NUMA nodes, even if it fits into one. This can cause performance issues. * If the VM has NUMA pinning and set to strict, it will fail to start, as there will not be enough HPs on the particular node. Version-Release number of selected component (if applicable): vdsm-4.30.40-1.el7ev.x86_64 How reproducible: Always Steps to Reproduce: 1. 4 NUMA node RHVH, with dynamic hugepages enabled # numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 node 0 size: 2047 MB node 0 free: 1833 MB node 1 cpus: 1 node 1 size: 2048 MB node 1 free: 1806 MB node 2 cpus: 2 node 2 size: 2048 MB node 2 free: 1860 MB node 3 cpus: 3 node 3 size: 2048 MB node 3 free: 1888 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 # cat /etc/vdsm/vdsm.conf [vars] ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1 ssl = true ssl_ciphers = HIGH:!aNULL 2. VM with 2 vNUMA nodes using 2M hugepages, pinned to the host above <custom_properties> <custom_property> <name>hugepages</name> <value>2048</value> </custom_property> </custom_properties> <numa_tune_mode>strict</numa_tune_mode> <vm_numa_nodes> <vm_numa_node href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5/numanodes/4f3916bd-ee0b-4492-9ca9-115040597f76" id="4f3916bd-ee0b-4492-9ca9-115040597f76"> <cpu> <cores> <core> <index>0</index> </core> <core> <index>1</index> </core> </cores> </cpu> <index>0</index> <memory>1024</memory> <numa_node_pins> <numa_node_pin> <index>1</index> </numa_node_pin> </numa_node_pins> <vm href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5" id="8f54d8f6-3a16-49cd-980d-73ec684796c5"/> </vm_numa_node> <vm_numa_node href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5/numanodes/3cd156e1-3d7c-4c0b-9dab-f66bbf8bc743" id="3cd156e1-3d7c-4c0b-9dab-f66bbf8bc743"> <cpu> <cores> <core> <index>2</index> </core> <core> <index>3</index> </core> </cores> </cpu> <index>1</index> <memory>1024</memory> <numa_node_pins> <numa_node_pin> <index>2</index> </numa_node_pin> </numa_node_pins> <vm href="/ovirt-engine/api/vms/8f54d8f6-3a16-49cd-980d-73ec684796c5" id="8f54d8f6-3a16-49cd-980d-73ec684796c5"/> </vm_numa_node> </vm_numa_nodes> 3. Start VM ... <cpu match="exact"> <model>Skylake-Client</model> <topology cores="1" sockets="16" threads="1" /> <numa> <cell cpus="0,1" id="0" memory="1048576" /> <cell cpus="2,3" id="1" memory="1048576" /> </numa> </cpu> ... 2020-03-06 12:54:20,053+1000 ERROR (vm/8f54d8f6) [virt.vm] (vmId='8f54d8f6-3a16-49cd-980d-73ec684796c5') The vm start process failed (vm:933) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2891, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) 2020-03-06T02:54:19.271151Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/6-RHEL7,size=1073741824,host-nodes=1,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM So what happened is that VDSM allocated: NUMA 0: 256 2M HPs NUMA 1: 256 2M HPs NUMA 2: 256 2M HPs NUMA 3: 256 2M HPs But the VM needed: NUMA 0: 0 2M HPs NUMA 1: 512 2M HPs NUMA 2: 512 2M HPs NUMA 3: 0 2M HPs If the numa policy is strict, the VM fails to start. Otherwise it starts but gets performance penalties. Actual results: * VM fails to start if numa policy is strict * Suboptimal memory allocation Expected results: * VM always starts * Optimal allocation from NUMA node
We'll try to 4.4.z, but targeting for 4.5 in case it's tricky
*** Bug 1948970 has been marked as a duplicate of this bug. ***
This bug/RFE is more than a year old and it didn't get enough attention so far, and is now flagged as pending close. Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release.
A related question is whether we may have a similar problem with static huge pages. AFAIK we don't specify NUMA nodes in <hugepages> and rely on the host OS to do the right thing. Which may be good enough unless static huge pages are combined with dynamic huge pages, there are not enough free static pages in the corresponding NUMA nodes, there is still free memory in the given NUMA nodes, and there are free static huge pages in other NUMA nodes. In such a case, the memory setup for the given VM will be also suboptimal, compared to allocating dynamic huge pages in the right NUMA nodes, in addition to what is already allocated elsewhere. But it can be argued that we needn't consider such a case (and static huge pages) -- if the users wanted having allocated more huge pages than strictly necessary, they would allocate as many static huge pages as possible in advance to avoid this problem.
This bug didn't get any attention in a long time, and it's not planned in foreseeable future. oVirt development team has no plans to work on it. Please feel free to reopen if you have a plan how to contribute this feature/bug fix.
We are past 4.5.0 feature freeze, please re-target.
no updates for a long time(again), missed release GA(again), closing(again)