Bug 1232350

Summary: Provide persistency for hugepages configuration
Product: Red Hat Enterprise Linux 7 Reporter: Luiz Capitulino <lcapitulino>
Component: qemu-kvmAssignee: Hai Huang <hhuang>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: alex.williamson, aquini, cye, dhoward, hhuang, lcapitulino, leiwang, lilu, liwan, lnykryn, lwoodman, owalsh, stephenfin, systemd-maint-list, tburke, virt-maint
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1435272 (view as bug list) Environment:
Last Closed: 2018-08-02 13:14:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1175461, 1296180, 1394638, 1472889    

Description Luiz Capitulino 2015-06-16 14:27:09 UTC
The HugeTLB subsystem in the Linux kernel provides bigger page sizes for user-space processes. In x86_64 for example, it provides 2MB and 1GB pages. Bigger pages is an important feature for high performance computing and virtualization.

Before processes can allocate a bigger page, they have to be manually reserved by the system administrator. One of the ways of doing this, and for most use-cases the recommended way, is to write the number of pages to be reserved in sysfs. For example, this reserves 10 1GB pages in node0:

# echo 10 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages

And this reserves 256 2MB pages in node4:

# echo 256 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB

The problem here is that, we need a way to make this configuration persistent. Not only that, but we also need the reservation to happen as early as possible during boot so that we avoid memory fragmentation to some extent.

PS: I'm not sure what's the right component for this, it could be libhugetlbfs, libhugetlbfs-utils or maybe systemd itself.

Comment 1 Luiz Capitulino 2015-06-16 14:31:14 UTC
This is an _example_ of what I've provided to a customer in the past:

1. Create a file named /usr/lib/systemd/system/hugetlb-gigantic-pages.service
   with the following contents:

[Unit]
Description=HugeTLB Gigantic Pages Reservation
DefaultDependencies=no
Before=dev-hugepages.mount
ConditionPathExists=/sys/devices/system/node
ConditionKernelCommandLine=hugepagesz=1G

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/lib/systemd/hugetlb-reserve-pages

[Install]
WantedBy=sysinit.target

2. Create a file named /usr/lib/systemd/hugetlb-reserve-pages with the
   following contents:

#!/bin/bash

nodes_path=/sys/devices/system/node/
if [ ! -d $nodes_path ]; then
	echo "ERROR: $nodes_path does not exist"
	exit 1
fi

reserve_pages()
{
	echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages
}

# This example reserves 2 1G pages on node0 and 1 1G page on node1. You
# can modify it to your needs or add more lines to reserve memory in
# other nodes. Don't forget to uncomment the lines, otherwise then won't
# be executed.
# reserve_pages 2 node0
# reserve_pages 1 node1

3. Run the following commands to enable early boot reservation:

# chmod +x /usr/lib/systemd/hugetlb-reserve-pages
# systemctl enable hugetlb-gigantic-pages

4. Modify /usr/lib/systemd/hugetlb-reserve-pages according to the
   comments in the file

5. Reboot the machine

Comment 3 Petr Holasek 2015-07-14 13:53:50 UTC
(In reply to Luiz Capitulino from comment #1)
> 
> 4. Modify /usr/lib/systemd/hugetlb-reserve-pages according to the
>    comments in the file

Did you let customer edit the file manually or was there any wrapper for doing that? Anyway, using systemd for persistent but configurable hugepage allocation is smart and I don't see any obstacle to do it like that.

Comment 4 Luiz Capitulino 2015-07-14 21:00:04 UTC
The customer would edit the file manually. Yes, this is hacky but it was a one-time solution for a specific customer...

Comment 7 Lukáš Nykrýn 2016-02-01 12:05:12 UTC
Why you can't set this through sysctl.conf?

Comment 8 Luiz Capitulino 2016-02-02 08:05:07 UTC
Because the per-NUMA node interface is in sysfs, not /proc/sys.

Comment 9 Lukáš Nykrýn 2016-02-02 08:34:52 UTC
Well this is not a generic thing that should be handled by systemd. I think a better solution will be a unitfile in hugetbl-utils.

Comment 10 Luiz Capitulino 2016-02-16 14:37:37 UTC
Lukáš, I agree with you. I think the best solution is a unitfile in hugetbl-utils.

Comment 14 Larry Woodman 2017-04-10 20:05:26 UTC
This is not a libhugetlbfs library or hugetlbfs kernel problem.  Its an issue with sysctl or even tuned where the tunabkle parameters are stored and held over reboots.  Let me look further into where this should be enhanced.


Larry Woodman

Comment 15 Stephen Finucane 2017-10-18 10:52:34 UTC
Is this looking like something that might ship in RHEL 7.5? Persistent, per-NUMA node hugepage configuration is something that we'd like to be able to rely on from an OpenStack perspective.

Comment 16 Rafael Aquini 2018-08-01 22:20:04 UTC
As noted on comment 14, this is not libhugetlbfs ground and it seems that systemd is not the right place to carry these requirements -- comment 9. Probably, this should be dealt with by QEMU KVM-RT team under their own tooling and/or documentation.

I'm switching the component over to qemu-kvm, but if neigher QEMU team thinks this requests suits there, then please just get this ticket closed.

Regards,
-- Rafael

Comment 17 Luiz Capitulino 2018-08-02 13:14:21 UTC
This is certainly not a qemu or kvm issue, since kvm guests are not the only users of hugepages in Linux (I'd even guess HugeTLB exists before KVM).

We (as most users) solved this problem by using init scripts such as /etc/rc.d/rc.local. Since this workaround has always worked and since we seem unable to give a better solution, I'll just close the BZ.