Created attachment 1356848 [details] engine and vdsm logs Description of problem: The engine fails to start VM with 1Gb hugepages and NUMA pinning Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20171116212005.git61ffb5f.el7.centos.noarch vdsm-4.20.7-34.gitab15536.el7.centos.x86_64 qemu-kvm-common-ev-2.9.0-16.el7_4.8.1.x86_64 qemu-kvm-ev-2.9.0-16.el7_4.8.1.x86_64 How reproducible: Always Steps to Reproduce: 1. Configure VM: Memory: 3Gb CPU's: 2 Hugepages Custom Property: 1048576 Pin it to host with at least two NUMA nodes A number of NUMA nodes: 2 Pin each VM NUMA node to separate physical NUMA node 2. Start the VM 3. Actual results: File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: process exited while connecting to monitor: qemu_madvise: Invalid argument madvise doesn't support MADV_DONTDUMP, but dump_guest_core=off specified 2017-11-21T16:00:39.054676Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu/1-golden_env_mixed_vir,size=1610612736,host-nodes=0,policy=interleave: cannot bind memory to host NUMA nodes: Invalid argument 2017-11-21 18:00:39,569+0200 INFO (vm/fd80374a) [virt.vm] (vmId='fd80374a-cb00-4f78-9121-6e87cee581c0') Changed state to Down: internal error: process exited while connecting to monitor: qemu_madvise: Invalid argument Expected results: I think we have two options, or block VM start on the scheduler level, or somehow round NUMA nodes hugepages usage Additional info:
thoughts?
https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ?
(In reply to Yaniv Kaul from comment #2) > https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ? Exactly that is the issue. You need to take care that each NUMA node memory fits to the 1GB memory boundary of the Hugepages. So e.g the following fails: memory = 21GB NUMA Nodes = 4 NUMA Memory Node 1 = 5.25GB NUMA Memory Node 2 = 5.25GB NUMA Memory Node 3 = 5.25GB NUMA Memory Node 4 = 5.25GB It would succeed in case of a asynchronous reservation, e.g.: memory = 21GB NUMA Nodes = 4 NUMA Memory Node 1 = 6.00GB NUMA Memory Node 2 = 5.00GB NUMA Memory Node 3 = 5.00GB NUMA Memory Node 4 = 5.00GB As you don't know in advance how many NUMA Nodes are created, you cannot check the total memory, and as such I believe we need to do the asynchronous reservation and maybe print a warning (NUMA node imbalanced memory due to max memory/NUMA nodes does not fit hugepages boundary). Thoughts?
(In reply to Martin Tessun from comment #3) > (In reply to Yaniv Kaul from comment #2) > > https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ? > > Exactly that is the issue. > You need to take care that each NUMA node memory fits to the 1GB memory > boundary of the Hugepages. > > So e.g the following fails: > memory = 21GB > NUMA Nodes = 4 > NUMA Memory Node 1 = 5.25GB > NUMA Memory Node 2 = 5.25GB > NUMA Memory Node 3 = 5.25GB > NUMA Memory Node 4 = 5.25GB > > It would succeed in case of a asynchronous reservation, e.g.: > memory = 21GB > NUMA Nodes = 4 > NUMA Memory Node 1 = 6.00GB > NUMA Memory Node 2 = 5.00GB > NUMA Memory Node 3 = 5.00GB > NUMA Memory Node 4 = 5.00GB > > As you don't know in advance how many NUMA Nodes are created, you cannot > check the total memory, and as such I believe we need to do the asynchronous > reservation and maybe print a warning (NUMA node imbalanced memory due to > max memory/NUMA nodes does not fit hugepages boundary). > > Thoughts? I'd limit our solution to uniform memory distribution across NUMA nodes, and then make sure the total memory is properly dividable between the number of set NUMA nodes - just to ensure we properly fail before running? When you pin, you should know the theoretical values (of course, some memory when you try to run may already be taken!)
(In reply to Yaniv Kaul from comment #4) > (In reply to Martin Tessun from comment #3) > > (In reply to Yaniv Kaul from comment #2) > > > https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ? > > > > Exactly that is the issue. > > You need to take care that each NUMA node memory fits to the 1GB memory > > boundary of the Hugepages. > > > > So e.g the following fails: > > memory = 21GB > > NUMA Nodes = 4 > > NUMA Memory Node 1 = 5.25GB > > NUMA Memory Node 2 = 5.25GB > > NUMA Memory Node 3 = 5.25GB > > NUMA Memory Node 4 = 5.25GB > > > > It would succeed in case of a asynchronous reservation, e.g.: > > memory = 21GB > > NUMA Nodes = 4 > > NUMA Memory Node 1 = 6.00GB > > NUMA Memory Node 2 = 5.00GB > > NUMA Memory Node 3 = 5.00GB > > NUMA Memory Node 4 = 5.00GB > > > > As you don't know in advance how many NUMA Nodes are created, you cannot > > check the total memory, and as such I believe we need to do the asynchronous > > reservation and maybe print a warning (NUMA node imbalanced memory due to > > max memory/NUMA nodes does not fit hugepages boundary). > > > > Thoughts? > > I'd limit our solution to uniform memory distribution across NUMA nodes, and > then make sure the total memory is properly dividable between the number of > set NUMA nodes - just to ensure we properly fail before running? why not fail on save? We know how many numa nodes will we create on run and how many memory do we need to divide between them. We could provide a good validation message. > When you pin, you should know the theoretical values (of course, some memory > when you try to run may already be taken!)
it should be easy enough to split the memory into the right chunks on engine side. In the original case to 2 GB and 1 GB, we do define the amount of memory in each node I believe - Martin?
We do allow custom size NUMA nodes when configured through REST API (iirc). But the UI based flow distributes the memory uniformly. Btw, it is hard to show some meaningful validation when hugepage sizes are set using the generic custom parameters approach.
(In reply to Martin Sivák from comment #7) > We do allow custom size NUMA nodes when configured through REST API (iirc). > But the UI based flow distributes the memory uniformly. so the ui based flow could have some logic to distribute them not-uniformly :) But there will need to be a validation anyway, because there is a chance that it can not be split correctly at all (e.g. 2 numa nodes, 1G pages and 1G memory, so one of the nodes would end up with no memory). > > Btw, it is hard to show some meaningful validation when hugepage sizes are > set using the generic custom parameters approach. I don't see the issue here. It is just a property and we use it.
Checked on rhvm-4.2.2.1-0.1.el7.noarch VM configuration: <vm> <name>golden_env_mixed_virtio_0</name> </bios> <cpu> <architecture>x86_64</architecture> <topology> <cores>2</cores> <sockets>1</sockets> <threads>1</threads> </topology> </cpu> <custom_properties> <custom_property> <name>hugepages</name> <value>1048576</value> </custom_property> </custom_properties> <memory>3221225472</memory> <memory_policy> <guaranteed>1073741824</guaranteed> <max>4294967296</max> </memory_policy> <placement_policy> <affinity>pinned</affinity> <hosts> <hosthref="/ovirt-engine/api/hosts/745204e0-f625-4577-b194-124f82a314fa"id="745204e0-f625-4577-b194-124f82a314fa"/> </hosts> </placement_policy> <numa_tune_mode>interleave</numa_tune_mode> </vm> With NUMA nodes: <vm_numa_nodes> <vm_numa_nodehref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899/numanodes/cf1d0720-aef8-4bb7-b5be-f3d2e3e30b64"id="cf1d0720-aef8-4bb7-b5be-f3d2e3e30b64"> <cpu> <cores> <core>…</core> </cores> </cpu> <index>0</index> <memory>1536</memory> <numa_node_pins> <numa_node_pin> <index>0</index> </numa_node_pin> </numa_node_pins> <vmhref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899"id="245b22b0-e711-4f48-9a47-26a9b15aa899"/> </vm_numa_node> <vm_numa_nodehref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899/numanodes/705ec19c-6c0f-45ee-970b-7f03f5bbc5d0"id="705ec19c-6c0f-45ee-970b-7f03f5bbc5d0"> <cpu> <cores> <core>…</core> </cores> </cpu> <index>1</index> <memory>1536</memory> <numa_node_pins> <numa_node_pin> <index>1</index> </numa_node_pin> </numa_node_pins> <vmhref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899"id="245b22b0-e711-4f48-9a47-26a9b15aa899"/> </vm_numa_node> </vm_numa_nodes> VM failed to start with the same error: 2018-03-04 16:09:44.304+0000: 5834: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x7f3fc8111eb0 qemu_madvise: Invalid argument madvise doesn't support MADV_DONTDUMP, but dump_guest_core=off specified 2018-03-04T16:09:44.609717Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu/1-golden_env_mixed_vir,size=1610612736,host-nodes=0,policy=interleave: cannot bind memory to host NUMA nodes: Invalid argument 2018-03-04 16:09:44.680+0000: shutting down, reason=failed
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
The fix was released in 4.2.2.2. It is not yet in 4.2.2.1.
Verified on rhvm-4.2.2.2-0.1.el7.noarch 1) Define 2 NUMA nodes with 1.5Gb each and start VM Status: 400 Reason: Bad Request Detail: [Memory size of each numa node must be a multiple of hugepage size.] 2) Define 2 NUMA nodes with 1Gb each and start VM VM started successfully
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.