Bug 1076957 (libvirt-api-hugepages)
| Summary: | Expose huge pages information through libvirt API | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Stephen Gordon <sgordon> |
| Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.0 | CC: | dyuan, gsun, jdenemar, jmiao, mprivozn, mzhan, rbalakri, tvvcox, weizhan, xuzhang |
| Target Milestone: | rc | Keywords: | FutureFeature, Upstream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | libvirt-1.2.7-1.el7 | Doc Type: | Enhancement |
| Doc Text: |
Feature:
Expoose huge page information through a libvirt API
Reason:
When allocating bigger chunks of memory, huge pages come handy. They have less overhead than if the memory was allocated the standard way. However, huge pages require hardware (CPU) cooperation. Therefore, not every page size is available everywhere - it depends on host CPU, host OS configuration, and so on. Having said that, when users want to use huge pages to back memory on their guests, they need to know what huge page sizes are available. So we need to learn libvirt to gather this kind of info and expose it.
Result:
The huge pages info is exposed in capability XML ('virsh capabilities') and the amount of free pages is again exposed via API ('virsh freepages'). Note, that each host NUMA node has its own huge pages pool.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-03-05 07:31:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1057941, 1078542 | ||
|
Description
Stephen Gordon
2014-03-16 18:45:30 UTC
I've just proposed patches upstream: https://www.redhat.com/archives/libvir-list/2014-May/msg00991.html It's available in C as virNodeHugeTLB() API, in virsh (`virsh hugepages`) and libvirt-python too: To get overall info pass -1 as NODE#, to get info on specific node, pass valid NODE#: #virsh hugepages Supported hugepage sizes: hugepage_size 1048576 hugepage_available 4 hugepage_free 4 hugepage_size 2048 hugepage_available 12 hugepage_free 12 As we can see the host supports 1GiB and 2MiB hugepages (and none of them is being used right now). Another attempt: https://www.redhat.com/archives/libvir-list/2014-June/msg00435.html Yet another one: https://www.redhat.com/archives/libvir-list/2014-June/msg00710.html So I've just pushed patches upstream:
commit 38fa03f4b0f5f84642cd99b6b8704f5028984770
Author: Michal Privoznik <mprivozn>
AuthorDate: Tue Jun 10 16:16:44 2014 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Thu Jun 19 15:10:50 2014 +0200
nodeinfo: Implement nodeGetFreePages
And add stubs to other drivers like: lxc, qemu, uml and vbox.
Signed-off-by: Michal Privoznik <mprivozn>
commit 9e3efe53ded95e6b3284f7f55f625da87018e484
Author: Michal Privoznik <mprivozn>
AuthorDate: Mon Jun 9 17:56:43 2014 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Thu Jun 19 15:10:50 2014 +0200
virsh: Expose virNodeGetFreePages
The new API is exposed under 'freepages' command.
Signed-off-by: Michal Privoznik <mprivozn>
commit 34f2d0319d2098c77c8cc27d8350616029125a2b
Author: Michal Privoznik <mprivozn>
AuthorDate: Mon Jun 9 17:14:47 2014 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Thu Jun 19 15:10:49 2014 +0200
Introduce virNodeGetFreePages
The aim of the API is to get information on number of free pages
on the system. The API behaves similar to the
virNodeGetCellsFreeMemory(). User passes starting NUMA cell, the
count of nodes that he's interested in, pages sizes (yes,
multiple sizes can be queried at once) and the counts are
returned in an array.
Signed-off-by: Michal Privoznik <mprivozn>
commit 02129b7c0e581898f03468e0bfb5472dc9903339
Author: Michal Privoznik <mprivozn>
AuthorDate: Fri Jun 6 18:12:51 2014 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Thu Jun 19 15:10:49 2014 +0200
virCaps: expose pages info
There are two places where you'll find info on page sizes. The first
one is under <cpu/> element, where all supported pages sizes are
listed. Then the second one is under each <cell/> element which refers
to concrete NUMA node. At this place, the size of page's pool is
reported. So the capabilities XML looks something like this:
<capabilities>
<host>
<uuid>01281cda-f352-cb11-a9db-e905fe22010c</uuid>
<cpu>
<arch>x86_64</arch>
<model>Westmere</model>
<vendor>Intel</vendor>
<topology sockets='1' cores='1' threads='1'/>
...
<pages unit='KiB' size='4'/>
<pages unit='KiB' size='2048'/>
<pages unit='KiB' size='1048576'/>
</cpu>
...
<topology>
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>4054408</memory>
<pages unit='KiB' size='4'>1013602</pages>
<pages unit='KiB' size='2048'>3</pages>
<pages unit='KiB' size='1048576'>1</pages>
<distances/>
<cpus num='1'>
<cpu id='0' socket_id='0' core_id='0' siblings='0'/>
</cpus>
</cell>
<cell id='1'>
<memory unit='KiB'>4071072</memory>
<pages unit='KiB' size='4'>1017768</pages>
<pages unit='KiB' size='2048'>3</pages>
<pages unit='KiB' size='1048576'>1</pages>
<distances/>
<cpus num='1'>
<cpu id='1' socket_id='0' core_id='0' siblings='1'/>
</cpus>
</cell>
...
</cells>
</topology>
...
</host>
<guest/>
</capabilities>
Signed-off-by: Michal Privoznik <mprivozn>
commit 35f1095e12abf333903915f96f029612648346d4
Author: Michal Privoznik <mprivozn>
AuthorDate: Fri Jun 6 18:09:01 2014 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Thu Jun 19 15:10:49 2014 +0200
virnuma: Introduce pages helpers
For future work we need two functions that fetches total number of
pages and number of free pages for given NUMA node and page size
(virNumaGetPageInfo()).
Then we need to learn pages of what sizes are supported on given node
(virNumaGetPages()).
Note that system page size is disabled at the moment as there's one
issue connected. If you have a NUMA node with huge pages allocated the
kernel would return the normal size of memory for that node. It
basically ignores the fact that huge pages steal size from the system
memory. Until we resolve this, it's safer to not confuse users and
hence not report any system pages yet.
Signed-off-by: Michal Privoznik <mprivozn>
commit 99a63aed2d3a660b61a21f30da677d9e625510a6
Author: Michal Privoznik <mprivozn>
AuthorDate: Mon Jun 16 14:02:34 2014 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Thu Jun 19 15:10:49 2014 +0200
nodeinfo: Rename nodeGetFreeMemory to nodeGetMemory
For future work we want to get info for not only the free memory
but overall memory size too. That's why the function must have
new signature too.
Signed-off-by: Michal Privoznik <mprivozn>
commit 356c6f389fcff5ca74b393a0d94f7542c1be9d81
Author: Michal Privoznik <mprivozn>
AuthorDate: Mon Jun 16 14:29:15 2014 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Thu Jun 19 15:10:49 2014 +0200
virnuma: Introduce virNumaNodeIsAvailable
Not on all hosts the set of NUMA nodes IDs is continuous. This is
critical, because our code currently assumes the set doesn't contain
holes. For instance in nodeGetFreeMemory() we can see the following
pattern:
if ((max_node = virNumaGetMaxNode()) < 0)
return 0;
for (n = 0; n <= max_node; n++) {
...
}
while it should be something like this:
if ((max_node = virNumaGetMaxNode()) < 0)
return 0;
for (n = 0; n <= max_node; n++) {
if (!virNumaNodeIsAvailable(n))
continue;
...
}
Signed-off-by: Michal Privoznik <mprivozn>
v1.2.5-166-g38fa03f
Hi Michal,
I found freepages reports wrong number of hugepages.
1. setup hugepages
# echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
# echo 513 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
# echo 1 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
# echo 2 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
2. check from capabilities, and it's correct
# virsh capabilities | grep page
<pages unit='KiB' size='4'/>
<pages unit='KiB' size='2048'/>
<pages unit='KiB' size='1048576'/>
<pages unit='KiB' size='4'>15985225</pages>
<pages unit='KiB' size='2048'>512</pages>
<pages unit='KiB' size='1048576'>2</pages>
<pages unit='KiB' size='4'>15728128</pages>
<pages unit='KiB' size='2048'>513</pages>
<pages unit='KiB' size='1048576'>3</pages>
3. check from freepages, it's wrong
# virsh freepages --all
Node 0:
4KiB: 15461962
2048KiB: 512
1048576KiB: 0
Node 1:
4KiB: 15197930
2048KiB: 1
1048576KiB: 3
(In reply to Jincheng Miao from comment #11) > Hi Michal, > > I found freepages reports wrong number of hugepages. > > 1. setup hugepages > # echo 512 > > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > > # echo 513 > > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages > > # echo 1 > > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages > > # echo 2 > > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages > While this tells kernel to allocate hugepages, due to memory fragmentation the operation may not succeed. And while it's easier to find smaller chunks of continuous free memory, 2M is more likely to succeed. > 2. check from capabilities, and it's correct > # virsh capabilities | grep page > <pages unit='KiB' size='4'/> > <pages unit='KiB' size='2048'/> > <pages unit='KiB' size='1048576'/> > <pages unit='KiB' size='4'>15985225</pages> > <pages unit='KiB' size='2048'>512</pages> > <pages unit='KiB' size='1048576'>2</pages> > <pages unit='KiB' size='4'>15728128</pages> > <pages unit='KiB' size='2048'>513</pages> > <pages unit='KiB' size='1048576'>3</pages> If you check it with what kernel reports, is there any difference? I mean, what number is shown in in nr_hugepages file on nodes 0 and 1 for 1G hugepage? Does it correspond to what libvirt reports? cat /sys/devices/system/node/node{0,1}/hugepages/hugepages-1048576kB/nr_hugepages ; virsh capabilities | grep pages; virsh freepages --all Moreover, it takes some time for kernel to allocate the pages, so it's better to run the commands above at once. (In reply to Michal Privoznik from comment #12) > If you check it with what kernel reports, is there any difference? I mean, > what number is shown in in nr_hugepages file on nodes 0 and 1 for 1G > hugepage? Does it correspond to what libvirt reports? > > cat > /sys/devices/system/node/node{0,1}/hugepages/hugepages-1048576kB/ > nr_hugepages ; virsh capabilities | grep pages; virsh freepages --all > > Moreover, it takes some time for kernel to allocate the pages, so it's > better to run the commands above at once. Yes, you are right. Wait for a while, the 1G hugepage is consistent from nr_hugepages and freepages. Thanks for your advice. This feature is implemented:
1. add hugepages allocation in kernel cmdline:
'default_hugepagesz=1G hugepagesz=1G hugepages=2 hugepagesz=2M hugepages=300'
2. configure hugepage mount point for libvirt
# vim /etc/libvirt/qemu.conf
...
hugetlbfs_mount = ["/dev/hugepages2M", "/dev/hugepages1G"]
...
# mkdir /dev/hugepages2M
# mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M
# mkdir /dev/hugepages1G
# mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G
3. check it via virsh capabilities
# virsh capabilities
...
<cpu>
...
<pages unit='KiB' size='4'/>
<pages unit='KiB' size='2048'/>
<pages unit='KiB' size='1048576'/>
</cpu>
...
<topology>
<cells num='1'>
<cell id='0'>
<memory unit='KiB'>7863696</memory>
<pages unit='KiB' size='4'>1288036</pages>
<pages unit='KiB' size='2048'>300</pages>
<pages unit='KiB' size='1048576'>2</pages>
...
4. use freepages to query free pages
# virsh freepages --all
Node 0:
4KiB: 482318
2048KiB: 300
1048576KiB: 2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html |