Red Hat Bugzilla – Bug 1076957
Expose huge pages information through libvirt API
Last modified: 2016-04-26 10:36:56 EDT
Description of problem: Customer requests the ability to determine the following via the Libvirt API: - The host's large page size. - The total number of large pages available per host, and ideally per NUMA node. - The total number of large pages free (versus in use) per host, and ideally per NUMA node. It is anticipated that additional changes to lower level components including qemu may be required to facilitate the above. Additional information: Gap 1. It is not possible to get through libvirt the large page size configured in the host. This is necessary to know if a host will accept a VM using large pages of 1GiB size. Command for solution in Linux: $ sudo hugeadm --page-sizes 1073741824 The large page size is expressed in bytes Gap 2. It is not possible to get through libvirt the number of large pages configured per NUMA node. This feature and the next one are necessary to know if a host will accept a VM using a specific number of large pages of 1GiB size in a specific NUMA. Command for solution in Linux: $ hugepage_sz=1048576; nodes_nr=2; for node_id in `seq 0 $((nodes_nr-1))`; do echo -n "Node $node_id: "; cat /sys/devices/system/node/node$node_id/hugepages/hugepages-$((hugepage_sz))kB/nr_hugepages; done Node 0: 28 Node 1: 28 Where nodes_nr is the number of NUMA nodes and hugepase_sz is the size of the large page Gap 3. It is not possible to get through libvirt the number of free large pages configured per NUMA node. This feature and the previous one are necessary to know if a host will accept a VM using a specific number of large pages of 1GiB size in a specific NUMA. Command for solution in Linux: $ hugepage_sz=1048576; nodes_nr=2; for node_id in `seq 0 $((nodes_nr-1))`; do echo -n "Node $node_id: "; cat /sys/devices/system/node/node$node_id/hugepages/hugepages-$((hugepage_sz))kB/free_hugepages; done Node 0: 24
I've just proposed patches upstream: https://www.redhat.com/archives/libvir-list/2014-May/msg00991.html It's available in C as virNodeHugeTLB() API, in virsh (`virsh hugepages`) and libvirt-python too: To get overall info pass -1 as NODE#, to get info on specific node, pass valid NODE#: #virsh hugepages Supported hugepage sizes: hugepage_size 1048576 hugepage_available 4 hugepage_free 4 hugepage_size 2048 hugepage_available 12 hugepage_free 12 As we can see the host supports 1GiB and 2MiB hugepages (and none of them is being used right now).
Another attempt: https://www.redhat.com/archives/libvir-list/2014-June/msg00435.html
Yet another one: https://www.redhat.com/archives/libvir-list/2014-June/msg00710.html
So I've just pushed patches upstream: commit 38fa03f4b0f5f84642cd99b6b8704f5028984770 Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Tue Jun 10 16:16:44 2014 +0200 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Thu Jun 19 15:10:50 2014 +0200 nodeinfo: Implement nodeGetFreePages And add stubs to other drivers like: lxc, qemu, uml and vbox. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> commit 9e3efe53ded95e6b3284f7f55f625da87018e484 Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Mon Jun 9 17:56:43 2014 +0200 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Thu Jun 19 15:10:50 2014 +0200 virsh: Expose virNodeGetFreePages The new API is exposed under 'freepages' command. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> commit 34f2d0319d2098c77c8cc27d8350616029125a2b Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Mon Jun 9 17:14:47 2014 +0200 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Thu Jun 19 15:10:49 2014 +0200 Introduce virNodeGetFreePages The aim of the API is to get information on number of free pages on the system. The API behaves similar to the virNodeGetCellsFreeMemory(). User passes starting NUMA cell, the count of nodes that he's interested in, pages sizes (yes, multiple sizes can be queried at once) and the counts are returned in an array. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> commit 02129b7c0e581898f03468e0bfb5472dc9903339 Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Fri Jun 6 18:12:51 2014 +0200 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Thu Jun 19 15:10:49 2014 +0200 virCaps: expose pages info There are two places where you'll find info on page sizes. The first one is under <cpu/> element, where all supported pages sizes are listed. Then the second one is under each <cell/> element which refers to concrete NUMA node. At this place, the size of page's pool is reported. So the capabilities XML looks something like this: <capabilities> <host> <uuid>01281cda-f352-cb11-a9db-e905fe22010c</uuid> <cpu> <arch>x86_64</arch> <model>Westmere</model> <vendor>Intel</vendor> <topology sockets='1' cores='1' threads='1'/> ... <pages unit='KiB' size='4'/> <pages unit='KiB' size='2048'/> <pages unit='KiB' size='1048576'/> </cpu> ... <topology> <cells num='4'> <cell id='0'> <memory unit='KiB'>4054408</memory> <pages unit='KiB' size='4'>1013602</pages> <pages unit='KiB' size='2048'>3</pages> <pages unit='KiB' size='1048576'>1</pages> <distances/> <cpus num='1'> <cpu id='0' socket_id='0' core_id='0' siblings='0'/> </cpus> </cell> <cell id='1'> <memory unit='KiB'>4071072</memory> <pages unit='KiB' size='4'>1017768</pages> <pages unit='KiB' size='2048'>3</pages> <pages unit='KiB' size='1048576'>1</pages> <distances/> <cpus num='1'> <cpu id='1' socket_id='0' core_id='0' siblings='1'/> </cpus> </cell> ... </cells> </topology> ... </host> <guest/> </capabilities> Signed-off-by: Michal Privoznik <mprivozn@redhat.com> commit 35f1095e12abf333903915f96f029612648346d4 Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Fri Jun 6 18:09:01 2014 +0200 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Thu Jun 19 15:10:49 2014 +0200 virnuma: Introduce pages helpers For future work we need two functions that fetches total number of pages and number of free pages for given NUMA node and page size (virNumaGetPageInfo()). Then we need to learn pages of what sizes are supported on given node (virNumaGetPages()). Note that system page size is disabled at the moment as there's one issue connected. If you have a NUMA node with huge pages allocated the kernel would return the normal size of memory for that node. It basically ignores the fact that huge pages steal size from the system memory. Until we resolve this, it's safer to not confuse users and hence not report any system pages yet. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> commit 99a63aed2d3a660b61a21f30da677d9e625510a6 Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Mon Jun 16 14:02:34 2014 +0200 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Thu Jun 19 15:10:49 2014 +0200 nodeinfo: Rename nodeGetFreeMemory to nodeGetMemory For future work we want to get info for not only the free memory but overall memory size too. That's why the function must have new signature too. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> commit 356c6f389fcff5ca74b393a0d94f7542c1be9d81 Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Mon Jun 16 14:29:15 2014 +0200 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Thu Jun 19 15:10:49 2014 +0200 virnuma: Introduce virNumaNodeIsAvailable Not on all hosts the set of NUMA nodes IDs is continuous. This is critical, because our code currently assumes the set doesn't contain holes. For instance in nodeGetFreeMemory() we can see the following pattern: if ((max_node = virNumaGetMaxNode()) < 0) return 0; for (n = 0; n <= max_node; n++) { ... } while it should be something like this: if ((max_node = virNumaGetMaxNode()) < 0) return 0; for (n = 0; n <= max_node; n++) { if (!virNumaNodeIsAvailable(n)) continue; ... } Signed-off-by: Michal Privoznik <mprivozn@redhat.com> v1.2.5-166-g38fa03f
Hi Michal, I found freepages reports wrong number of hugepages. 1. setup hugepages # echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages # echo 513 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages # echo 1 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages # echo 2 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages 2. check from capabilities, and it's correct # virsh capabilities | grep page <pages unit='KiB' size='4'/> <pages unit='KiB' size='2048'/> <pages unit='KiB' size='1048576'/> <pages unit='KiB' size='4'>15985225</pages> <pages unit='KiB' size='2048'>512</pages> <pages unit='KiB' size='1048576'>2</pages> <pages unit='KiB' size='4'>15728128</pages> <pages unit='KiB' size='2048'>513</pages> <pages unit='KiB' size='1048576'>3</pages> 3. check from freepages, it's wrong # virsh freepages --all Node 0: 4KiB: 15461962 2048KiB: 512 1048576KiB: 0 Node 1: 4KiB: 15197930 2048KiB: 1 1048576KiB: 3
(In reply to Jincheng Miao from comment #11) > Hi Michal, > > I found freepages reports wrong number of hugepages. > > 1. setup hugepages > # echo 512 > > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > > # echo 513 > > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages > > # echo 1 > > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages > > # echo 2 > > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages > While this tells kernel to allocate hugepages, due to memory fragmentation the operation may not succeed. And while it's easier to find smaller chunks of continuous free memory, 2M is more likely to succeed. > 2. check from capabilities, and it's correct > # virsh capabilities | grep page > <pages unit='KiB' size='4'/> > <pages unit='KiB' size='2048'/> > <pages unit='KiB' size='1048576'/> > <pages unit='KiB' size='4'>15985225</pages> > <pages unit='KiB' size='2048'>512</pages> > <pages unit='KiB' size='1048576'>2</pages> > <pages unit='KiB' size='4'>15728128</pages> > <pages unit='KiB' size='2048'>513</pages> > <pages unit='KiB' size='1048576'>3</pages> If you check it with what kernel reports, is there any difference? I mean, what number is shown in in nr_hugepages file on nodes 0 and 1 for 1G hugepage? Does it correspond to what libvirt reports? cat /sys/devices/system/node/node{0,1}/hugepages/hugepages-1048576kB/nr_hugepages ; virsh capabilities | grep pages; virsh freepages --all Moreover, it takes some time for kernel to allocate the pages, so it's better to run the commands above at once.
(In reply to Michal Privoznik from comment #12) > If you check it with what kernel reports, is there any difference? I mean, > what number is shown in in nr_hugepages file on nodes 0 and 1 for 1G > hugepage? Does it correspond to what libvirt reports? > > cat > /sys/devices/system/node/node{0,1}/hugepages/hugepages-1048576kB/ > nr_hugepages ; virsh capabilities | grep pages; virsh freepages --all > > Moreover, it takes some time for kernel to allocate the pages, so it's > better to run the commands above at once. Yes, you are right. Wait for a while, the 1G hugepage is consistent from nr_hugepages and freepages. Thanks for your advice.
This feature is implemented: 1. add hugepages allocation in kernel cmdline: 'default_hugepagesz=1G hugepagesz=1G hugepages=2 hugepagesz=2M hugepages=300' 2. configure hugepage mount point for libvirt # vim /etc/libvirt/qemu.conf ... hugetlbfs_mount = ["/dev/hugepages2M", "/dev/hugepages1G"] ... # mkdir /dev/hugepages2M # mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M # mkdir /dev/hugepages1G # mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G 3. check it via virsh capabilities # virsh capabilities ... <cpu> ... <pages unit='KiB' size='4'/> <pages unit='KiB' size='2048'/> <pages unit='KiB' size='1048576'/> </cpu> ... <topology> <cells num='1'> <cell id='0'> <memory unit='KiB'>7863696</memory> <pages unit='KiB' size='4'>1288036</pages> <pages unit='KiB' size='2048'>300</pages> <pages unit='KiB' size='1048576'>2</pages> ... 4. use freepages to query free pages # virsh freepages --all Node 0: 4KiB: 482318 2048KiB: 300 1048576KiB: 2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html