Bug 1076725 (libvirt-multinode-numa-policy)
Summary: | libvirt: Multi-node NUMA policy assignment | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Marcelo Tosatti <mtosatti> | |
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 7.0 | CC: | chegu_vinod, dyuan, gsun, honzhang, jmiao, jsuchane, knoel, mtosatti, mzhan, rbalakri, sgordon | |
Target Milestone: | rc | Keywords: | FutureFeature, Upstream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-1.2.7-1.el7 | Doc Type: | Enhancement | |
Doc Text: |
Feature:
Allow guest RAM to come from multiple host NUMA nodes.
Reason:
Well, if you have a host with NUMA nodes and you want to run a guest, you either have it run on a single NUMA node, or - when it doesn't fit in - on several nodes. However, there a price you have to pay if you don't pin guest onto host NUMA nodes. The guest memory can travel between host NUMA nodes as the host kernel scheduler please. And copying data between NUMA nodes is costly. Therefore you need a way to pin guest memory onto hots NUMA nodes to prevent it going around. While libvirt allowed vCPU pinning, it did not allowed memory pinning.
Result:
With this feature, libvirt was fixed to allow memory to be pinned on host NUMA nodes as user wants.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1134665 (view as bug list) | Environment: | ||
Last Closed: | 2015-03-05 07:31:30 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 996259, 1076723 | |||
Bug Blocks: | 1078542, 1113520, 1134665 |
Description
Marcelo Tosatti
2014-03-14 22:09:09 UTC
We need to take another look at this and determine what the driver is. (In reply to Marcelo Tosatti from comment #0) > A virtual machine using 2 NUMA nodes, with different huge pages number for > each NUMA node." More generically this can be expressed as need to be able to ensure that VMs requiring huge pages get them regardless of topology. Can you be more specific please? What does qemu API look like? What is the usual use case? I've tried to dig out the qemu patches, but got lost in the primeval forest of qemu sources. Just to give anybody insterested in an update. I've got some patches ready: https://gitorious.org/libvirt/michal-staging/commits/multinode_1076725 Please note that the patches serve pure proof-of-concept aim. I haven't published them even on the upstream yet. Patches proposed upstream: https://www.redhat.com/archives/libvir-list/2014-July/msg00906.html Have the patches been accepted on the upstream libvirt already ? -- Also is it possible to have some of the VM's memory on a given NUMA node backed by huge pages (either 2M or 1GB pages) and the remaining backed by regular pages/THPs ? If there is a document that explains it pl. point me to the same. Thx ! (In reply to Vinod Chegu from comment #8) > Have the patches been accepted on the upstream libvirt already ? > > -- > > Also is it possible to have some of the VM's memory on a given NUMA node > backed by huge pages (either 2M or 1GB pages) and the remaining backed by > regular pages/THPs ? This should be answered by the QEMU code. Libvirt should support however QEMU behaves in this case. You should pose this question on the upstream qemu-devel e-mail list and explain your use case. Assuming you mean only guest aligned huge pages, 2MB or 1GB... (To leave THP out of this, I'll assume we're talking about 1GB pages.) I think the answer should be "no". The reason to back guest pages with host huge pages is so they can be used as huge pages by the guest OS. If some 1GB pages are backed by 1GB host pages and others are not, there is no way to tell from the guest OS which pages are "fast" and which are "slow". It is logically simpler to guarantee all aligned 1GB pages are backed by host 1GB pages or fail, for the strict case. I also assuming "strict", otherwise you wouldn't be specifying particular host NUMA nodes. > If there is a document that explains it pl. point me > to the same. Thx ! There was a thread in qemu-devel by Marcelo that explained the algorithm for assigning host huge pages to the guest, avoiding holes in the guest address space, etc. It was in this thread: http://marc.info/?l=qemu-devel&m=138428870610032&w=2 (In reply to Karen Noel from comment #9) > (In reply to Vinod Chegu from comment #8) > > Have the patches been accepted on the upstream libvirt already ? > > > > -- > > > > Also is it possible to have some of the VM's memory on a given NUMA node > > backed by huge pages (either 2M or 1GB pages) and the remaining backed by > > regular pages/THPs ? > > This should be answered by the QEMU code. Libvirt should support however > QEMU behaves in this case. > > You should pose this question on the upstream qemu-devel e-mail list and > explain your use case. > > Assuming you mean only guest aligned huge pages, 2MB or 1GB... (To leave THP > out of this, I'll assume we're talking about 1GB pages.) > > I think the answer should be "no". The reason to back guest pages with host > huge pages is so they can be used as huge pages by the guest OS. If some 1GB > pages are backed by 1GB host pages and others are not, there is no way to > tell from the guest OS which pages are "fast" and which are "slow". > > It is logically simpler to guarantee all aligned 1GB pages are backed by > host 1GB pages or fail, for the strict case. I also assuming "strict", > otherwise you wouldn't be specifying particular host NUMA nodes. > > > If there is a document that explains it pl. point me > > to the same. Thx ! > > There was a thread in qemu-devel by Marcelo that explained the algorithm for > assigning host huge pages to the guest, avoiding holes in the guest address > space, etc. > > It was in this thread: http://marc.info/?l=qemu-devel&m=138428870610032&w=2 Yeah, libvirt follows this and currently there's no way how to assign a guest NUMA node mixture of huge pages and regular system pages backing. Patches pushed upstream: commit 3517e1b2f2211f30e40f1a141f6dd1e6358e96ee Author: Michal Privoznik <mprivozn> AuthorDate: Wed Jul 23 17:37:21 2014 +0200 Commit: Daniel P. Berrange <berrange> CommitDate: Tue Jul 29 12:14:52 2014 +0100 qemu: Implement ./hugepages/page/[@size, @unit, @nodeset] Signed-off-by: Michal Privoznik <mprivozn> commit 136ad49740f017aabcac48d02d2df6ab7b0195e9 Author: Michal Privoznik <mprivozn> AuthorDate: Wed Jul 23 17:37:20 2014 +0200 Commit: Daniel P. Berrange <berrange> CommitDate: Tue Jul 29 12:02:34 2014 +0100 domain: Introduce ./hugepages/page/[@size, @unit, @nodeset] <memoryBacking> <hugepages> <page size="1" unit="G" nodeset="0-3,5"/> <page size="2" unit="M" nodeset="4"/> </hugepages> </memoryBacking> Signed-off-by: Michal Privoznik <mprivozn> commit 49baed2b298232acbcd910948b1a058a97ff331c Author: Michal Privoznik <mprivozn> AuthorDate: Wed Jul 23 17:37:19 2014 +0200 Commit: Daniel P. Berrange <berrange> CommitDate: Tue Jul 29 12:00:42 2014 +0100 virbitmap: Introduce virBitmapOverlaps This internal API just checks if two bitmaps intersect or not. Signed-off-by: Michal Privoznik <mprivozn> commit 725a211fc0c04568acdd3737da867684ada09c03 Author: Michal Privoznik <mprivozn> AuthorDate: Wed Jul 23 17:37:18 2014 +0200 Commit: Daniel P. Berrange <berrange> CommitDate: Tue Jul 29 11:58:35 2014 +0100 qemu: Utilize virFileFindHugeTLBFS Use better detection of hugetlbfs mount points. Yes, there can be multiple mount points each serving different huge page size. Since we already have ability to override the mount point in the qemu.conf file, this crazy backward compatibility code is brought in. Now we allow multiple mount points, so the "hugetlbfs_mount" option must take an list of strings (mount points). But previously, it was just a string, so we must accept both types now. Signed-off-by: Michal Privoznik <mprivozn> commit be0782e199243bdeb0f1bf85028fb0e7267f28b0 Author: Michal Privoznik <mprivozn> AuthorDate: Wed Jul 23 17:37:17 2014 +0200 Commit: Daniel P. Berrange <berrange> CommitDate: Tue Jul 29 11:25:16 2014 +0100 Introduce virFileFindHugeTLBFS This should iterate over mount tab and search for hugetlbfs among with looking for the default value of huge pages. Signed-off-by: Michal Privoznik <mprivozn> v1.2.7-rc1-14-g3517e1b This feature is implemented in latest libvirt # rpm -q libvirt qemu-kvm-rhev libvirt-1.2.8-7.el7.x86_64 qemu-kvm-rhev-2.1.2-9.el7.x86_64 # uname -r 3.10.0-205.el7.x86_64 1. enable 1G hugepages in kernel add following to kernel cmdline 'default_hugepagesz=1G hugepagesz=1G hugepages=4 hugepagesz=2M hugepages=1024' 2. mount 2M and 1G hugepages # mkdir /dev/hugepages1G # mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G # mkdir /dev/hugepages2M # mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M 3. modify qemu.conf as: # diff -Nura /etc/libvirt/qemu.conf /etc/libvirt/qemu.conf.bak --- /etc/libvirt/qemu.conf.bak 2014-11-24 04:53:19.912564561 -0500 +++ /etc/libvirt/qemu.conf 2014-11-23 21:29:46.852704349 -0500 @@ -342,7 +342,7 @@ # be specified at once, separated by comma and enclosed in square # brackets, for example: # -# hugetlbfs_mount = ["/dev/hugepages2M", "/dev/hugepages1G"] +hugetlbfs_mount = ["/dev/hugepages2M", "/dev/hugepages1G"] # # The size of huge page served by specific mount point is determined by # libvirt at the daemon startup. # systemctl restart libvirtd 4. check host capabilities # virsh capabilities | grep page <pages unit='KiB' size='4'/> <pages unit='KiB' size='2048'/> <pages unit='KiB' size='1048576'/> <pages unit='KiB' size='4'>15985225</pages> <pages unit='KiB' size='2048'>512</pages> <pages unit='KiB' size='1048576'>2</pages> <pages unit='KiB' size='4'>15990784</pages> <pages unit='KiB' size='2048'>512</pages> <pages unit='KiB' size='1048576'>2</pages> 5. configure hugepages for guest NUMA node gNode #0: 1G with 2M-hugepages, strict pinned to Host node #1. gNode #1: 2G with 1G-hugepages, strict pinned to Host node #0. # virsh edit r71 ... <memory unit='KiB'>3145728</memory> <currentMemory unit='KiB'>3145728</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0'/> <page size='1048576' unit='KiB' nodeset='1'/> </hugepages> </memoryBacking> <vcpu placement='static'>4</vcpu> <numatune> <memory mode='strict' nodeset='0-1'/> <memnode cellid='0' mode='strict' nodeset='1'/> <memnode cellid='1' mode='strict' nodeset='0'/> </numatune> <cpu mode='host-model'> <model fallback='allow'/> <numa> <cell id='0' cpus='0-1' memory='1048576'/> <cell id='1' cpus='2-3' memory='2097152'/> </numa> </cpu> ... # virsh start r71 Domain r71 started check Guest NUMA information: <guest># numactl --hard available: 2 nodes (0-1) node 0 cpus: 0 1 node 0 size: 1023 MB node 0 free: 692 MB node 1 cpus: 2 3 node 1 size: 2047 MB node 1 free: 917 MB node distances: node 0 1 0: 10 20 1: 20 10 If hugepages are not enough for guest NUMA nodes, there will be an error like: 2014-11-24T09:38:09.426075Z qemu-kvm: -object memory-backend-file,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu,size=2048M,id=ram-node1,host-nodes=0,policy=bind: unable to map backing store for hugepages: Cannot allocate memory Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html |