Bug 1772133

Summary: Transparent Huge Pages set to [always] is sub-optimal for many applications
Product: Red Hat Enterprise Linux 8 Reporter: Mark Nelson <mnelson>
Component: kernelAssignee: Andrea Arcangeli <aarcange>
kernel sub component: Memory Management QA Contact: Ping Fang <pifang>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aarcange, aquini, mm-maint, pifang
Version: ---   
Target Milestone: rc   
Target Release: 8.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-03 01:40:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mark Nelson 2019-11-13 17:47:17 UTC
Description of problem:

Transparent Huge Pages provides real benefit to certain applications by potentially reducing TLB misses and improving performance. For other applications, it can bloat memory usage and cause performance regressions.  By default, the kernel enables THP for applications that explicitly ask for it via MADV_HUGEPAGE:

> "madvise" will enter direct reclaim like "always" but only for regions
> that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

https://www.kernel.org/doc/Documentation/vm/transhuge.txt

RHEL, CentOS, and CoreOS (but not Fedora) all appear to override this behavior and set THP to [always].  This unfortunately causes issues with a large variety of software including, but not limited to:

splunk: https://docs.splunk.com/Documentation/Splunk/7.3.2/ReleaseNotes/SplunkandTHP
mongodb: https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
couchbase: https://docs.couchbase.com/server/current/install/thp-disable.html
oracle: https://blogs.oracle.com/linux/performance-issues-with-transparent-huge-pages-thp
nuodb: http://doc.nuodb.com/4.0/Content/OpenShift-disable-THP.htm
Go runtime: https://github.com/golang/go/issues/8832
jemalloc: https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/
node.js: https://github.com/nodejs/node/issues/11077
tcmalloc: https://github.com/gperftools/gperftools/issues/1073

More recently, we've also seen memory usage bloat in Ceph (using tcmalloc) when THP is set to always potentially resulting in OOM when running inside containers.  There are various ways to potentially work around this at the application level including using MADV_NOHUGEPAGE or a prctl flag.  Requiring these workarounds to disable THP for a given application is counter-intuitive for several reasons:

1) It deviates from the default kernel behavior without a strong justification as to why.

2) It puts the onus on developers to explicitly stop the kernel from engaging in sub-optimal behavior.

3) It's incredibly confusing to have a system-wide default that claims to "always" enable a setting that many applications may or may not silently disable through workarounds.

Finally, when another prominent distribution was faced with a similar choice, they ran stream and malloc tests showing improvement at various allocation sizes when THP was disabled.  Ultimately that lead them to switching back to the kernel default (ie madvise) with no apparent performance regressions:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1703742

Version-Release number of selected component (if applicable):


How reproducible:

This is a well known issue that can be reproduced via a variety of software.  Steps to reproduce in ceph are listed below.

Steps to Reproduce:
1. Install a single OSD ceph cluster.
2. Run a background write workload using hsbench or fio sufficient to fill the ceph-osd caches.
3. compare memory usage of the OSD process when THP is set to [always] vs [madvise]

Actual results:

https://docs.google.com/spreadsheets/d/1Xl3nWapi7ZKEmpnsSHHWO96iopEG0hK6GeDWhWKSfDo/edit?usp=sharing

Expected results:

These are the expected results when THP is set to [always] instead of [madvise] and the application does not explicitly override the kernel settings.  Optimally THP would only be used in situations where it provides a benefit and not a regression.

Additional Information:

https://unix.stackexchange.com/questions/495816/which-distributions-enable-transparent-huge-pages-for-all-applications
https://www.percona.com/blog/2019/03/06/settling-the-myth-of-transparent-hugepages-for-databases/
https://blog.nelhage.com/post/transparent-hugepages/
https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/
https://dl.acm.org/citation.cfm?id=3359640

Comment 1 Mark Nelson 2019-11-13 19:04:33 UTC
Update:

While the kernel documentation claims that madvise is the default, the actual code in mm/Kconfig shows that "always" is the default choice, so I retract the statement about differing from the kernel.  See:

https://github.com/torvalds/linux/blob/master/mm/Kconfig#L385-L407

Still, I think the rest stands.

Comment 2 Rafael Aquini 2019-11-19 03:02:25 UTC
Patch posted upstream suggesting the config change:

  https://lkml.org/lkml/2019/11/18/1031