Bug 1076989 (libvirt-complex-guest-mem)

Summary:	Enable complex memory requirements for virtual machines
Product:	Red Hat Enterprise Linux 7	Reporter:	Stephen Gordon <sgordon>
Component:	libvirt	Assignee:	Michal Privoznik <mprivozn>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.0	CC:	dyuan, gsun, honzhang, jdenemar, jmiao, jsuchane, knoel, mprivozn, mzhan, rbalakri, sgordon
Target Milestone:	rc	Keywords:	TestOnly
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-1.2.8-5.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	qemu-complex-mem 1136151 (view as bug list)		Environment:
Last Closed:	2015-03-05 07:32:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1076990
Bug Blocks:	1078542, 1113520, 1136151

Description Stephen Gordon 2014-03-17 00:25:29 UTC

Description of problem:

Enable the specification of complex memory requirements for virtual machines such as:

* A virtual machine using 2 NUMA nodes, with different huge pages number for each NUMA node
* A virtual machine with a specific number of huge pages and an additional amount of memory not backed by huge pages (this latter might be oversubscribed), guaranteeing that all memory comes from the same NUMA node

Comment 2 Stephen Gordon 2014-03-17 00:27:09 UTC

In addition (perhaps more generically stated):

It is not possible to assure through libvirt a deterministic memory allocation of huge pages in NUMA nodes.

Comment 3 Stephen Gordon 2014-03-17 00:31:58 UTC

(In reply to Stephen Gordon from comment #2)
> In addition (perhaps more generically stated):
> 
> It is not possible to assure through libvirt a deterministic memory
> allocation of huge pages in NUMA nodes.

As a result a virtual machine defined with a strict resource assignment (“strict” NUMA policy) might become running with a different resource assignment.

Ad-hoc solution

An ad-hoc method for strict resource assignment has been provided by Red Hat. This method should be initially part of libvirt and later on part of whatever virtual machine management framework.

The method is the following:

* Start the VM in a paused state.
* Check hugepages backing of guest RAM. An entry like this should appear:

$ sudo cat /proc/$pid-of-qemu/maps  | grep huge
7f0dc0000000-7f11c0000000 rw-p 00000000 00:11 36596     /mnt/huge/libvirt/qemu/kvm.mNHb6G (deleted)

* Check NUMA placement of guest RAM, verifying that all pages are on the desired node. An entry like this should appear:

$ sudo cat /proc/$pid-of-qemu/numa_maps |grep huge
7f0dc0000000 bind:0 file=/mnt/huge/libvirt/qemu/kvm.mNHb6G\040(deleted) huge anon=16 dirty=16 N0=16

Where N0 is the number of 1GiB huge pages allocated in NUMA node 0.

* If any page of the guest RAM is allocated on a different node, error out (presence of Ny on that line indicates a page on a different node other than Nx) 

* Check strict assignment of vCPUs to CPUs:
  * List all running domains
  * Query vcpupin of all domains
  * Error out if there is one physical CPU pinned to two different vCPUs
  * Override the VM state from “paused” to “running” to let it run.

Comment 5 Michal Privoznik 2014-05-20 15:36:41 UTC

Can you be more specific please? What does qemu API look like? What is the usual use case? I've tried to dig out the qemu patches, but got lost in the primeval forest of qemu sources.

Comment 6 Stephen Gordon 2014-05-21 16:44:21 UTC

From my notes, the example is that the user specifies NUMA 0, but NUMA 0 does not have enough free space so instead starts using space from NUMA 1. Currently start we start the virtual machine, ensure it is using memory from the NUMA that it was assigned strictly to or if not move/kill it (very much a reactive approach/hack). The problem is that currently specifying strict doesn’t include the NUMA assignment.

The suggestion is that qemu-kvm be modified such that strict not only enforces the use of huge pages but the NUMA node assignment both for cores and where the huge pages are coming from.

I believe Karen suggested/discussed the above with us and the customer and might have some more background.

Comment 7 Karen Noel 2014-05-21 17:21:07 UTC

Strict should also enforce use of huge pages and NUMA node placement. Here is the  BZ for this:

Bug 996750 - strict NUMA policy on hugetlbfs backed guests
https://bugzilla.redhat.com/show_bug.cgi?id=996750

Everything I see about "strict" is related to memory. And the strict memory placement is enforced by QEMU. I'm not sure about cores. The pinning is done by libvirt, I believe. So, the question is what happens if the vcpu pinning to a cpuset fails? Does the domain fail to start? If so, is it already considered "strict"?

Or, is there is different meaning of "strict" for vcpus?

Comment 8 Michal Privoznik 2014-05-23 08:02:16 UTC

(In reply to Karen Noel from comment #7)
> Strict should also enforce use of huge pages and NUMA node placement. Here
> is the  BZ for this:
> 
> Bug 996750 - strict NUMA policy on hugetlbfs backed guests
> https://bugzilla.redhat.com/show_bug.cgi?id=996750
> 
> Everything I see about "strict" is related to memory. And the strict memory
> placement is enforced by QEMU. I'm not sure about cores. The pinning is done
> by libvirt, I believe. So, the question is what happens if the vcpu pinning
> to a cpuset fails? Does the domain fail to start? If so, is it already
> considered "strict"?
> 
> Or, is there is different meaning of "strict" for vcpus?

Right, if vcpu pinning fails, the domain is killed (we can do pinning only after qemu is started and has spawned vcpu threads).

Comment 9 Stephen Gordon 2014-05-23 12:27:59 UTC

What currently happens in the circumstance where I request pinning to a NUMA node (0), set strict, and not enough memory (huge pages if requested) is available on that NUMA node (0).

* Does the request fail?
* Do I instead get memory from another NUMA node (1) while my cores remain pinned to the NUMA node I specified (0)?
* Do I instead get moved to another NUMA node (1)?

Assume here that when I say "pinned to a NUMA node" I explicitly specified pinning to the cores within the node.

From my understanding of customer demands their expectation is the first scenario, that the request fails.

Comment 10 Stephen Gordon 2014-05-23 12:28:28 UTC

Karen, does that reflect your recollection of our discussions?

Comment 11 Karen Noel 2014-05-23 17:37:54 UTC

(In reply to Stephen Gordon from comment #9)
> What currently happens in the circumstance where I request pinning to a NUMA
> node (0), set strict, and not enough memory (huge pages if requested) is
> available on that NUMA node (0).
> 
> * Does the request fail?
> * Do I instead get memory from another NUMA node (1) while my cores remain
> pinned to the NUMA node I specified (0)?
> * Do I instead get moved to another NUMA node (1)?
> 
> Assume here that when I say "pinned to a NUMA node" I explicitly specified
> pinning to the cores within the node.
> 
> From my understanding of customer demands their expectation is the first
> scenario, that the request fails.

Yes, that it my understanding too.

Let's make sure we test all these scenarios and demonstrate that when strict is specified the VM fails to start if the configured memory conditions are not met.

Comment 12 Michal Privoznik 2014-08-12 09:57:05 UTC

So after my patches, it's still unclear to me what's required here. I mean, is this a test only bug, or is it a duplicate of another one (e.g. bug 1076725)?
Other option is that I say this bug is fixed by the patchset and follow the usual workflow.

Comment 13 Stephen Gordon 2014-08-29 03:45:01 UTC

The question from my side is still how does the implementation work now, does it match one of the scenarios outlined in comment # 9 (ideally the first one).

Comment 14 Michal Privoznik 2014-09-01 14:52:25 UTC

(In reply to Stephen Gordon from comment #9)
> What currently happens in the circumstance where I request pinning to a NUMA
> node (0), set strict, and not enough memory (huge pages if requested) is
> available on that NUMA node (0).
> 
> * Does the request fail?

Yes. It's the memory allocation that will actually throw an error. Even though, the allocation is done in qemu once spawned by libvirt.

> * Do I instead get memory from another NUMA node (1) while my cores remain
> pinned to the NUMA node I specified (0)?

No, as long as you set 'strict' mode. If you set 'preferred' then qemu may find another suitable NUMA node if the preferred one doesn't have enough memory.

> * Do I instead get moved to another NUMA node (1)?

Again. In strict mode everything either works as configured (vCPUs / memory is pinned) or domain fails to start. To get moved you'll need to relax the mode to preferred.

> 
> Assume here that when I say "pinned to a NUMA node" I explicitly specified
> pinning to the cores within the node.
> 
> From my understanding of customer demands their expectation is the first
> scenario, that the request fails.

Yep. That's how it works.

Comment 16 Jincheng Miao 2014-11-24 09:47:59 UTC

The guest complex memory requirements could be configured in libvirt,
and there is an error reporting when no enough hugepages for guest.

# rpm -q libvirt qemu-kvm-rhev
libvirt-1.2.8-7.el7.x86_64
qemu-kvm-rhev-2.1.2-9.el7.x86_64

# uname -r
3.10.0-205.el7.x86_64

Cross memory pinning for guest NUMA nodes.

The test scenario is:

Host node #0 only has 512 2M-hugepages 
Host node #1 only has 2 1G-hugepages
gNode #0: 1G with 2M-hugepages, pinned to Host node #1.
gNode #1: 2G with 1G-hugepages, pinned to Host node #0.

# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
0
# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
2
# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
512
# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
0

1. configure guest NUMA with two nodes:
# virsh edit r71
...
  <memory unit='KiB'>3145728</memory>
  <currentMemory unit='KiB'>3145728</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB' nodeset='0'/>
      <page size='1048576' unit='KiB' nodeset='1'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
    <memnode cellid='0' mode='strict' nodeset='1'/>
    <memnode cellid='1' mode='strict' nodeset='0'/>
  </numatune>
  <cpu mode='host-model'>
    <model fallback='allow'/>
    <numa>
      <cell id='0' cpus='0-1' memory='1048576'/>
      <cell id='1' cpus='2-3' memory='2097152'/>
    </numa>
  </cpu>
...

2. start guest
# virsh start r71
Domain r71 started

in guest:
<guest># numactl --hard
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 1023 MB
node 0 free: 716 MB
node 1 cpus: 2 3
node 1 size: 2047 MB
node 1 free: 899 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

3. check hugepage alignment between Host and Guest


# grep ram-node0 /proc/`pidof qemu-kvm`/smaps
2aaaaac00000-2aaaeac00000 rw-p 00000000 00:24 19215                      /dev/hugepages2M/libvirt/qemu/qemu_back_mem._objects_ram-node0.9xHi88 (deleted)

# grep 2aaaaac00000 /proc/`pidof qemu-kvm`/numa_maps
2aaaaac00000 bind:1 file=/dev/hugepages2M/libvirt/qemu/qemu_back_mem._objects_ram-node0.9xHi88\040(deleted) huge anon=512 dirty=512 N1=512

So memory-object for guest node 0 is 2M-hugepages, and is aligned to Host Node 1.

# grep ram-node1 /proc/`pidof qemu-kvm`/smaps
2aab00000000-2aab80000000 rw-p 00000000 00:25 19216                      /dev/hugepages1G/libvirt/qemu/qemu_back_mem._objects_ram-node1.rsof2i (deleted)

# grep 2aab00000000 /proc/`pidof qemu-kvm`/numa_maps
2aab00000000 bind:0 file=/dev/hugepages1G/libvirt/qemu/qemu_back_mem._objects_ram-node1.rsof2i\040(deleted) huge anon=2 dirty=2 N0=2

So memory-object for guest node 1 is 1G-hugepages, and is aligned to Host Node 0.


Negative test for insufficent hugepages
# echo 511 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

# virsh start r71
error: Failed to start domain r71
error: internal error: early end of file from monitor: possible problem:
2014-11-24T09:36:07.912519Z qemu-kvm: -object memory-backend-file,prealloc=yes,mem-path=/dev/hugepages2M/libvirt/qemu,size=1024M,id=ram-node0,host-nodes=1,policy=bind: unable to map backing store for hugepages: Cannot allocate memory

# echo 512 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

# echo 1 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages

# virsh start r71
error: Failed to start domain r71
error: internal error: early end of file from monitor: possible problem:
2014-11-24T09:38:09.426075Z qemu-kvm: -object memory-backend-file,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu,size=2048M,id=ram-node1,host-nodes=0,policy=bind: unable to map backing store for hugepages: Cannot allocate memory

Comment 18 errata-xmlrpc 2015-03-05 07:32:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html