The HugeTLB subsystem in the Linux kernel provides bigger page sizes for user-space processes. In x86_64 for example, it provides 2MB and 1GB pages. Bigger pages is an important feature for high performance computing and virtualization.
Before processes can allocate a bigger page, they have to be manually reserved by the system administrator. One of the ways of doing this, and for most use-cases the recommended way, is to write the number of pages to be reserved in sysfs. For example, this reserves 10 1GB pages in node0:
# echo 10 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
And this reserves 256 2MB pages in node4:
# echo 256 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB
The problem here is that, we need a way to make this configuration persistent. Not only that, but we also need the reservation to happen as early as possible during boot so that we avoid memory fragmentation to some extent.
PS: I'm not sure what's the right component for this, it could be libhugetlbfs, libhugetlbfs-utils or maybe systemd itself.
This is an _example_ of what I've provided to a customer in the past:
1. Create a file named /usr/lib/systemd/system/hugetlb-gigantic-pages.service
with the following contents:
Description=HugeTLB Gigantic Pages Reservation
2. Create a file named /usr/lib/systemd/hugetlb-reserve-pages with the
if [ ! -d $nodes_path ]; then
echo "ERROR: $nodes_path does not exist"
echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages
# This example reserves 2 1G pages on node0 and 1 1G page on node1. You
# can modify it to your needs or add more lines to reserve memory in
# other nodes. Don't forget to uncomment the lines, otherwise then won't
# be executed.
# reserve_pages 2 node0
# reserve_pages 1 node1
3. Run the following commands to enable early boot reservation:
# chmod +x /usr/lib/systemd/hugetlb-reserve-pages
# systemctl enable hugetlb-gigantic-pages
4. Modify /usr/lib/systemd/hugetlb-reserve-pages according to the
comments in the file
5. Reboot the machine
(In reply to Luiz Capitulino from comment #1)
> 4. Modify /usr/lib/systemd/hugetlb-reserve-pages according to the
> comments in the file
Did you let customer edit the file manually or was there any wrapper for doing that? Anyway, using systemd for persistent but configurable hugepage allocation is smart and I don't see any obstacle to do it like that.
The customer would edit the file manually. Yes, this is hacky but it was a one-time solution for a specific customer...
Why you can't set this through sysctl.conf?
Because the per-NUMA node interface is in sysfs, not /proc/sys.
Well this is not a generic thing that should be handled by systemd. I think a better solution will be a unitfile in hugetbl-utils.
Lukáš, I agree with you. I think the best solution is a unitfile in hugetbl-utils.
This is not a libhugetlbfs library or hugetlbfs kernel problem. Its an issue with sysctl or even tuned where the tunabkle parameters are stored and held over reboots. Let me look further into where this should be enhanced.
Is this looking like something that might ship in RHEL 7.5? Persistent, per-NUMA node hugepage configuration is something that we'd like to be able to rely on from an OpenStack perspective.
As noted on comment 14, this is not libhugetlbfs ground and it seems that systemd is not the right place to carry these requirements -- comment 9. Probably, this should be dealt with by QEMU KVM-RT team under their own tooling and/or documentation.
I'm switching the component over to qemu-kvm, but if neigher QEMU team thinks this requests suits there, then please just get this ticket closed.
This is certainly not a qemu or kvm issue, since kvm guests are not the only users of hugepages in Linux (I'd even guess HugeTLB exists before KVM).
We (as most users) solved this problem by using init scripts such as /etc/rc.d/rc.local. Since this workaround has always worked and since we seem unable to give a better solution, I'll just close the BZ.