1515933 – The engine fails to start VM with 1Gb hugepages and NUMA pinning

Bug 1515933 - The engine fails to start VM with 1Gb hugepages and NUMA pinning

Summary: The engine fails to start VM with 1Gb hugepages and NUMA pinning

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.2.2
Target Release:	4.2.2.2
Assignee:	Andrej Krejcir
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-21 16:18 UTC by Artyom
Modified:	2018-03-29 11:02 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-03-29 11:02:34 UTC
oVirt Team:	SLA
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ mtessun: planning_ack+ michal.skrivanek: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
engine and vdsm logs (1.17 MB, application/zip) 2017-11-21 16:18 UTC, Artyom	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	86963	'None'	MERGED	core: HugePageUtils returns huge page size as integer.	2020-11-10 13:28:55 UTC
oVirt gerrit	86964	'None'	MERGED	core: NumaValidator checks if node size is divisible by hugepage size	2020-11-10 13:28:55 UTC
oVirt gerrit	86965	'None'	MERGED	webadmin: Make VM numa node size divisible by hugepage size	2020-11-10 13:28:55 UTC
oVirt gerrit	87379	'None'	MERGED	core: HugePageUtils returns huge page size as integer.	2020-11-10 13:28:55 UTC
oVirt gerrit	87380	'None'	MERGED	core: NumaValidator checks if node size is divisible by hugepage size	2020-11-10 13:28:55 UTC
oVirt gerrit	87381	'None'	MERGED	webadmin: Make VM numa node size divisible by hugepage size	2020-11-10 13:28:57 UTC
oVirt gerrit	87781	'None'	MERGED	core: Fix MathUtils.greatestCommonDivisor()	2020-11-10 13:29:15 UTC

Description Artyom 2017-11-21 16:18:09 UTC

Created attachment 1356848 [details]
engine and vdsm logs

Description of problem:
The engine fails to start VM with 1Gb hugepages and NUMA pinning

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20171116212005.git61ffb5f.el7.centos.noarch
vdsm-4.20.7-34.gitab15536.el7.centos.x86_64
qemu-kvm-common-ev-2.9.0-16.el7_4.8.1.x86_64
qemu-kvm-ev-2.9.0-16.el7_4.8.1.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure VM:
Memory: 3Gb
CPU's: 2
Hugepages Custom Property: 1048576
Pin it to host with at least two NUMA nodes
A number of NUMA nodes: 2
Pin each VM NUMA node to separate physical NUMA node
2. Start the VM
3.

Actual results:
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: qemu_madvise: Invalid argument
madvise doesn't support MADV_DONTDUMP, but dump_guest_core=off specified
2017-11-21T16:00:39.054676Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu/1-golden_env_mixed_vir,size=1610612736,host-nodes=0,policy=interleave: cannot bind memory to host NUMA nodes: Invalid argument
2017-11-21 18:00:39,569+0200 INFO  (vm/fd80374a) [virt.vm] (vmId='fd80374a-cb00-4f78-9121-6e87cee581c0') Changed state to Down: internal error: process exited while connecting to monitor: qemu_madvise: Invalid argument

Expected results:
I think we have two options, or block VM start on the scheduler level, or somehow round NUMA nodes hugepages usage

Additional info:

Comment 1 Michal Skrivanek 2017-11-22 05:31:16 UTC

thoughts?

Comment 2 Yaniv Kaul 2017-11-22 07:01:17 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ?

Comment 3 Martin Tessun 2017-11-22 07:08:29 UTC

(In reply to Yaniv Kaul from comment #2)
> https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ?

Exactly that is the issue.
You need to take care that each NUMA node memory fits to the 1GB memory boundary of the Hugepages.

So e.g the following fails:
memory = 21GB
NUMA Nodes = 4
NUMA Memory Node 1 = 5.25GB
NUMA Memory Node 2 = 5.25GB
NUMA Memory Node 3 = 5.25GB
NUMA Memory Node 4 = 5.25GB

It would succeed in case of a asynchronous reservation, e.g.:
memory = 21GB
NUMA Nodes = 4
NUMA Memory Node 1 = 6.00GB
NUMA Memory Node 2 = 5.00GB
NUMA Memory Node 3 = 5.00GB
NUMA Memory Node 4 = 5.00GB

As you don't know in advance how many NUMA Nodes are created, you cannot check the total memory, and as such I believe we need to do the asynchronous reservation and maybe print a warning (NUMA node imbalanced memory due to max memory/NUMA nodes does not fit hugepages boundary).

Thoughts?

Comment 4 Yaniv Kaul 2017-11-22 07:32:34 UTC

(In reply to Martin Tessun from comment #3)
> (In reply to Yaniv Kaul from comment #2)
> > https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ?
> 
> Exactly that is the issue.
> You need to take care that each NUMA node memory fits to the 1GB memory
> boundary of the Hugepages.
> 
> So e.g the following fails:
> memory = 21GB
> NUMA Nodes = 4
> NUMA Memory Node 1 = 5.25GB
> NUMA Memory Node 2 = 5.25GB
> NUMA Memory Node 3 = 5.25GB
> NUMA Memory Node 4 = 5.25GB
> 
> It would succeed in case of a asynchronous reservation, e.g.:
> memory = 21GB
> NUMA Nodes = 4
> NUMA Memory Node 1 = 6.00GB
> NUMA Memory Node 2 = 5.00GB
> NUMA Memory Node 3 = 5.00GB
> NUMA Memory Node 4 = 5.00GB
> 
> As you don't know in advance how many NUMA Nodes are created, you cannot
> check the total memory, and as such I believe we need to do the asynchronous
> reservation and maybe print a warning (NUMA node imbalanced memory due to
> max memory/NUMA nodes does not fit hugepages boundary).
> 
> Thoughts?

I'd limit our solution to uniform memory distribution across NUMA nodes, and then make sure the total memory is properly dividable between the number of set NUMA nodes - just to ensure we properly fail before running?
When you pin, you should know the theoretical values (of course, some memory when you try to run may already be taken!)

Comment 5 Tomas Jelinek 2017-11-22 07:45:32 UTC

(In reply to Yaniv Kaul from comment #4)
> (In reply to Martin Tessun from comment #3)
> > (In reply to Yaniv Kaul from comment #2)
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ?
> > 
> > Exactly that is the issue.
> > You need to take care that each NUMA node memory fits to the 1GB memory
> > boundary of the Hugepages.
> > 
> > So e.g the following fails:
> > memory = 21GB
> > NUMA Nodes = 4
> > NUMA Memory Node 1 = 5.25GB
> > NUMA Memory Node 2 = 5.25GB
> > NUMA Memory Node 3 = 5.25GB
> > NUMA Memory Node 4 = 5.25GB
> > 
> > It would succeed in case of a asynchronous reservation, e.g.:
> > memory = 21GB
> > NUMA Nodes = 4
> > NUMA Memory Node 1 = 6.00GB
> > NUMA Memory Node 2 = 5.00GB
> > NUMA Memory Node 3 = 5.00GB
> > NUMA Memory Node 4 = 5.00GB
> > 
> > As you don't know in advance how many NUMA Nodes are created, you cannot
> > check the total memory, and as such I believe we need to do the asynchronous
> > reservation and maybe print a warning (NUMA node imbalanced memory due to
> > max memory/NUMA nodes does not fit hugepages boundary).
> > 
> > Thoughts?
> 
> I'd limit our solution to uniform memory distribution across NUMA nodes, and
> then make sure the total memory is properly dividable between the number of
> set NUMA nodes - just to ensure we properly fail before running?

why not fail on save? We know how many numa nodes will we create on run and how many memory do we need to divide between them. We could provide a good validation message.

> When you pin, you should know the theoretical values (of course, some memory
> when you try to run may already be taken!)

Comment 6 Michal Skrivanek 2017-11-22 09:01:32 UTC

it should be easy enough to split the memory into the right chunks on engine side. In the original case to 2 GB and 1 GB, we do define the amount of memory in each node I believe - Martin?

Comment 7 Martin Sivák 2017-11-22 09:27:42 UTC

We do allow custom size NUMA nodes when configured through REST API (iirc). But the UI based flow distributes the memory uniformly.

Btw, it is hard to show some meaningful validation when hugepage sizes are set using the generic custom parameters approach.

Comment 8 Tomas Jelinek 2017-11-22 10:03:11 UTC

(In reply to Martin Sivák from comment #7)
> We do allow custom size NUMA nodes when configured through REST API (iirc).
> But the UI based flow distributes the memory uniformly.

so the ui based flow could have some logic to distribute them not-uniformly :)

But there will need to be a validation anyway, because there is a chance that it can not be split correctly at all (e.g. 2 numa nodes, 1G pages and 1G memory, so one of the nodes would end up with no memory).

> 
> Btw, it is hard to show some meaningful validation when hugepage sizes are
> set using the generic custom parameters approach.

I don't see the issue here. It is just a property and we use it.

Comment 9 Artyom 2018-03-04 16:19:01 UTC

Checked on rhvm-4.2.2.1-0.1.el7.noarch

VM configuration:
<vm>
<name>golden_env_mixed_virtio_0</name>
</bios>
<cpu>
<architecture>x86_64</architecture>
<topology>
<cores>2</cores>
<sockets>1</sockets>
<threads>1</threads>
</topology>
</cpu>
<custom_properties>
<custom_property>
<name>hugepages</name>
<value>1048576</value>
</custom_property>
</custom_properties>
<memory>3221225472</memory>
<memory_policy>
<guaranteed>1073741824</guaranteed>
<max>4294967296</max>
</memory_policy>
<placement_policy>
<affinity>pinned</affinity>
<hosts>
<hosthref="/ovirt-engine/api/hosts/745204e0-f625-4577-b194-124f82a314fa"id="745204e0-f625-4577-b194-124f82a314fa"/>
</hosts>
</placement_policy>
<numa_tune_mode>interleave</numa_tune_mode>
</vm>

With NUMA nodes:
<vm_numa_nodes>
<vm_numa_nodehref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899/numanodes/cf1d0720-aef8-4bb7-b5be-f3d2e3e30b64"id="cf1d0720-aef8-4bb7-b5be-f3d2e3e30b64">
<cpu>
<cores>
<core>…</core>
</cores>
</cpu>
<index>0</index>
<memory>1536</memory>
<numa_node_pins>
<numa_node_pin>
<index>0</index>
</numa_node_pin>
</numa_node_pins>
<vmhref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899"id="245b22b0-e711-4f48-9a47-26a9b15aa899"/>
</vm_numa_node>
<vm_numa_nodehref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899/numanodes/705ec19c-6c0f-45ee-970b-7f03f5bbc5d0"id="705ec19c-6c0f-45ee-970b-7f03f5bbc5d0">
<cpu>
<cores>
<core>…</core>
</cores>
</cpu>
<index>1</index>
<memory>1536</memory>
<numa_node_pins>
<numa_node_pin>
<index>1</index>
</numa_node_pin>
</numa_node_pins>
<vmhref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899"id="245b22b0-e711-4f48-9a47-26a9b15aa899"/>
</vm_numa_node>
</vm_numa_nodes>

VM failed to start with the same error:
2018-03-04 16:09:44.304+0000: 5834: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x7f3fc8111eb0
qemu_madvise: Invalid argument
madvise doesn't support MADV_DONTDUMP, but dump_guest_core=off specified
2018-03-04T16:09:44.609717Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu/1-golden_env_mixed_vir,size=1610612736,host-nodes=0,policy=interleave: cannot bind memory to host NUMA nodes: Invalid argument
2018-03-04 16:09:44.680+0000: shutting down, reason=failed

Comment 10 Red Hat Bugzilla Rules Engine 2018-03-04 16:19:08 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 11 Andrej Krejcir 2018-03-05 09:47:17 UTC

The fix was released in 4.2.2.2. It is not yet in 4.2.2.1.

Comment 14 Artyom 2018-03-06 14:07:48 UTC

Verified on rhvm-4.2.2.2-0.1.el7.noarch

1) Define 2 NUMA nodes with 1.5Gb each and start VM
	Status: 400
	Reason: Bad Request
	Detail: [Memory size of each numa node must be a multiple of hugepage size.]

2) Define 2 NUMA nodes with 1Gb each and start VM
VM started successfully

Comment 15 Sandro Bonazzola 2018-03-29 11:02:34 UTC

This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.