Bug 1076725 (libvirt-multinode-numa-policy)

Summary:	libvirt: Multi-node NUMA policy assignment
Product:	Red Hat Enterprise Linux 7	Reporter:	Marcelo Tosatti <mtosatti>
Component:	libvirt	Assignee:	Michal Privoznik <mprivozn>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.0	CC:	chegu_vinod, dyuan, gsun, honzhang, jmiao, jsuchane, knoel, mtosatti, mzhan, rbalakri, sgordon
Target Milestone:	rc	Keywords:	FutureFeature, Upstream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-1.2.7-1.el7	Doc Type:	Enhancement
Doc Text:	Feature: Allow guest RAM to come from multiple host NUMA nodes. Reason: Well, if you have a host with NUMA nodes and you want to run a guest, you either have it run on a single NUMA node, or - when it doesn't fit in - on several nodes. However, there a price you have to pay if you don't pin guest onto host NUMA nodes. The guest memory can travel between host NUMA nodes as the host kernel scheduler please. And copying data between NUMA nodes is costly. Therefore you need a way to pin guest memory onto hots NUMA nodes to prevent it going around. While libvirt allowed vCPU pinning, it did not allowed memory pinning. Result: With this feature, libvirt was fixed to allow memory to be pinned on host NUMA nodes as user wants.	Story Points:	---
Clone Of:
Clones:	1134665 (view as bug list)		Environment:
Last Closed:	2015-03-05 07:31:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	996259, 1076723
Bug Blocks:	1078542, 1113520, 1134665

Description Marcelo Tosatti 2014-03-14 22:09:09 UTC

"Gap 12. 

It is not possible to define/launch through libvirt (or qemu-kvm or any other VMM) virtual machines with complex memory requirements such as:

A virtual machine using 2 NUMA nodes, with different huge pages number for each NUMA node."

Should backport upstream work to support host NUMA policy assignment to different guest NUMA nodes.

This BZ is for the libvirt part of the feature.

Comment 1 Stephen Gordon 2014-03-23 18:45:55 UTC

We need to take another look at this and determine what the driver is.

Comment 2 Stephen Gordon 2014-04-22 15:53:29 UTC

(In reply to Marcelo Tosatti from comment #0) 
> A virtual machine using 2 NUMA nodes, with different huge pages number for
> each NUMA node."

More generically this can be expressed as need to be able to ensure that VMs requiring huge pages get them regardless of topology.

Comment 3 Michal Privoznik 2014-05-20 15:37:01 UTC

Can you be more specific please? What does qemu API look like? What is the usual use case? I've tried to dig out the qemu patches, but got lost in the primeval forest of qemu sources.

Comment 5 Michal Privoznik 2014-05-27 16:18:53 UTC

Just to give anybody insterested in an update. I've got some patches ready:

https://gitorious.org/libvirt/michal-staging/commits/multinode_1076725

Please note that the patches serve pure proof-of-concept aim. I haven't published them even on the upstream yet.

Comment 7 Michal Privoznik 2014-07-17 16:14:05 UTC

Patches proposed upstream:

https://www.redhat.com/archives/libvir-list/2014-July/msg00906.html

Comment 8 Vinod Chegu 2014-07-26 12:49:47 UTC

Have the patches been accepted on the upstream libvirt already ? 

--

Also is it possible to have some of the VM's memory on a given NUMA node backed by huge pages (either 2M or 1GB pages) and the remaining backed by regular pages/THPs ?  If there is a document that explains it pl. point me to the same. Thx !

Comment 9 Karen Noel 2014-07-28 19:32:45 UTC

(In reply to Vinod Chegu from comment #8)
> Have the patches been accepted on the upstream libvirt already ? 
> 
> --
> 
> Also is it possible to have some of the VM's memory on a given NUMA node
> backed by huge pages (either 2M or 1GB pages) and the remaining backed by
> regular pages/THPs ?  

This should be answered by the QEMU code. Libvirt should support however QEMU behaves in this case.

You should pose this question on the upstream qemu-devel e-mail list and explain your use case.

Assuming you mean only guest aligned huge pages, 2MB or 1GB... (To leave THP out of this, I'll assume we're talking about 1GB pages.)

I think the answer should be "no". The reason to back guest pages with host huge pages is so they can be used as huge pages by the guest OS. If some 1GB pages are backed by 1GB host pages and others are not, there is no way to tell from the guest OS which pages are "fast" and which are "slow".

It is logically simpler to guarantee all aligned 1GB pages are backed by host 1GB pages or fail, for the strict case. I also assuming "strict", otherwise you wouldn't be specifying particular host NUMA nodes.

> If there is a document that explains it pl. point me
> to the same. Thx !

There was a thread in qemu-devel by Marcelo that explained the algorithm for assigning host huge pages to the guest, avoiding holes in the guest address space, etc.

It was in this thread: http://marc.info/?l=qemu-devel&m=138428870610032&w=2

Comment 10 Michal Privoznik 2014-08-12 09:52:37 UTC

(In reply to Karen Noel from comment #9)
> (In reply to Vinod Chegu from comment #8)
> > Have the patches been accepted on the upstream libvirt already ? 
> > 
> > --
> > 
> > Also is it possible to have some of the VM's memory on a given NUMA node
> > backed by huge pages (either 2M or 1GB pages) and the remaining backed by
> > regular pages/THPs ?  
> 
> This should be answered by the QEMU code. Libvirt should support however
> QEMU behaves in this case.
> 
> You should pose this question on the upstream qemu-devel e-mail list and
> explain your use case.
> 
> Assuming you mean only guest aligned huge pages, 2MB or 1GB... (To leave THP
> out of this, I'll assume we're talking about 1GB pages.)
> 
> I think the answer should be "no". The reason to back guest pages with host
> huge pages is so they can be used as huge pages by the guest OS. If some 1GB
> pages are backed by 1GB host pages and others are not, there is no way to
> tell from the guest OS which pages are "fast" and which are "slow".
> 
> It is logically simpler to guarantee all aligned 1GB pages are backed by
> host 1GB pages or fail, for the strict case. I also assuming "strict",
> otherwise you wouldn't be specifying particular host NUMA nodes.
> 
> > If there is a document that explains it pl. point me
> > to the same. Thx !
> 
> There was a thread in qemu-devel by Marcelo that explained the algorithm for
> assigning host huge pages to the guest, avoiding holes in the guest address
> space, etc.
> 
> It was in this thread: http://marc.info/?l=qemu-devel&m=138428870610032&w=2

Yeah, libvirt follows this and currently there's no way how to assign a guest NUMA node mixture of huge pages and regular system pages backing.

Comment 11 Michal Privoznik 2014-08-12 09:53:20 UTC

Patches pushed upstream:

commit 3517e1b2f2211f30e40f1a141f6dd1e6358e96ee
Author:     Michal Privoznik <mprivozn>
AuthorDate: Wed Jul 23 17:37:21 2014 +0200
Commit:     Daniel P. Berrange <berrange>
CommitDate: Tue Jul 29 12:14:52 2014 +0100

    qemu: Implement ./hugepages/page/[@size, @unit, @nodeset]
    
    Signed-off-by: Michal Privoznik <mprivozn>

commit 136ad49740f017aabcac48d02d2df6ab7b0195e9
Author:     Michal Privoznik <mprivozn>
AuthorDate: Wed Jul 23 17:37:20 2014 +0200
Commit:     Daniel P. Berrange <berrange>
CommitDate: Tue Jul 29 12:02:34 2014 +0100

    domain: Introduce ./hugepages/page/[@size, @unit, @nodeset]
    
      <memoryBacking>
        <hugepages>
          <page size="1" unit="G" nodeset="0-3,5"/>
          <page size="2" unit="M" nodeset="4"/>
        </hugepages>
      </memoryBacking>
    
    Signed-off-by: Michal Privoznik <mprivozn>

commit 49baed2b298232acbcd910948b1a058a97ff331c
Author:     Michal Privoznik <mprivozn>
AuthorDate: Wed Jul 23 17:37:19 2014 +0200
Commit:     Daniel P. Berrange <berrange>
CommitDate: Tue Jul 29 12:00:42 2014 +0100

    virbitmap: Introduce virBitmapOverlaps
    
    This internal API just checks if two bitmaps intersect or not.
    
    Signed-off-by: Michal Privoznik <mprivozn>

commit 725a211fc0c04568acdd3737da867684ada09c03
Author:     Michal Privoznik <mprivozn>
AuthorDate: Wed Jul 23 17:37:18 2014 +0200
Commit:     Daniel P. Berrange <berrange>
CommitDate: Tue Jul 29 11:58:35 2014 +0100

    qemu: Utilize virFileFindHugeTLBFS
    
    Use better detection of hugetlbfs mount points. Yes, there can be
    multiple mount points each serving different huge page size.
    
    Since we already have ability to override the mount point in the
    qemu.conf file, this crazy backward compatibility code is brought in.
    Now we allow multiple mount points, so the "hugetlbfs_mount" option
    must take an list of strings (mount points). But previously, it was
    just a string, so we must accept both types now.
    
    Signed-off-by: Michal Privoznik <mprivozn>

commit be0782e199243bdeb0f1bf85028fb0e7267f28b0
Author:     Michal Privoznik <mprivozn>
AuthorDate: Wed Jul 23 17:37:17 2014 +0200
Commit:     Daniel P. Berrange <berrange>
CommitDate: Tue Jul 29 11:25:16 2014 +0100

    Introduce virFileFindHugeTLBFS
    
    This should iterate over mount tab and search for hugetlbfs among with
    looking for the default value of huge pages.
    
    Signed-off-by: Michal Privoznik <mprivozn>


v1.2.7-rc1-14-g3517e1b

Comment 13 Jincheng Miao 2014-11-24 10:14:12 UTC

This feature is implemented in latest libvirt

# rpm -q libvirt qemu-kvm-rhev
libvirt-1.2.8-7.el7.x86_64
qemu-kvm-rhev-2.1.2-9.el7.x86_64

# uname -r
3.10.0-205.el7.x86_64

1. enable 1G hugepages in kernel
add following to kernel cmdline
'default_hugepagesz=1G hugepagesz=1G hugepages=4 hugepagesz=2M hugepages=1024'


2. mount 2M and 1G hugepages
# mkdir /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G
# mkdir /dev/hugepages2M
# mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M

3. modify qemu.conf as:

# diff -Nura /etc/libvirt/qemu.conf /etc/libvirt/qemu.conf.bak
--- /etc/libvirt/qemu.conf.bak	2014-11-24 04:53:19.912564561 -0500
+++ /etc/libvirt/qemu.conf	2014-11-23 21:29:46.852704349 -0500
@@ -342,7 +342,7 @@
 # be specified at once, separated by comma and enclosed in square
 # brackets, for example:
 #
-#          hugetlbfs_mount = ["/dev/hugepages2M", "/dev/hugepages1G"]
+hugetlbfs_mount = ["/dev/hugepages2M", "/dev/hugepages1G"]
 #
 # The size of huge page served by specific mount point is determined by
 # libvirt at the daemon startup.

# systemctl restart libvirtd

4. check host capabilities
# virsh capabilities | grep page
      <pages unit='KiB' size='4'/>
      <pages unit='KiB' size='2048'/>
      <pages unit='KiB' size='1048576'/>
          <pages unit='KiB' size='4'>15985225</pages>
          <pages unit='KiB' size='2048'>512</pages>
          <pages unit='KiB' size='1048576'>2</pages>
          <pages unit='KiB' size='4'>15990784</pages>
          <pages unit='KiB' size='2048'>512</pages>
          <pages unit='KiB' size='1048576'>2</pages>

5. configure hugepages for guest NUMA node
gNode #0: 1G with 2M-hugepages, strict pinned to Host node #1.
gNode #1: 2G with 1G-hugepages, strict pinned to Host node #0.

# virsh edit r71
...
  <memory unit='KiB'>3145728</memory>
  <currentMemory unit='KiB'>3145728</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB' nodeset='0'/>
      <page size='1048576' unit='KiB' nodeset='1'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
    <memnode cellid='0' mode='strict' nodeset='1'/>
    <memnode cellid='1' mode='strict' nodeset='0'/>
  </numatune>
  <cpu mode='host-model'>
    <model fallback='allow'/>
    <numa>
      <cell id='0' cpus='0-1' memory='1048576'/>
      <cell id='1' cpus='2-3' memory='2097152'/>
    </numa>
  </cpu>
...

# virsh start r71
Domain r71 started

check Guest NUMA information:
<guest># numactl --hard
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 1023 MB
node 0 free: 692 MB
node 1 cpus: 2 3
node 1 size: 2047 MB
node 1 free: 917 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

If hugepages are not enough for guest NUMA nodes, there will be an error like:
2014-11-24T09:38:09.426075Z qemu-kvm: -object memory-backend-file,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu,size=2048M,id=ram-node1,host-nodes=0,policy=bind: unable to map backing store for hugepages: Cannot allocate memory

Comment 15 errata-xmlrpc 2015-03-05 07:31:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html