Bug 1626127 - glibc: [RFE] Improve malloc performance in low-memory scenarios for threaded arenas.
Summary: glibc: [RFE] Improve malloc performance in low-memory scenarios for threaded ...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: glibc
Version: 8.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 8.2
Assignee: glibc team
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-06 15:51 UTC by Paulo Andrade
Modified: 2023-07-18 14:30 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-24 14:28:40 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
malloc_bad_mmap_pthread.c (4.03 KB, text/plain)
2018-09-06 15:51 UTC, Paulo Andrade
no flags Details
malloc_nomem.c (1.87 KB, text/plain)
2018-09-06 15:55 UTC, Paulo Andrade
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Sourceware 25594 0 P2 NEW [RFE] Improve malloc performance in low-memory scenarios for threaded arenas. 2020-06-28 01:13:08 UTC

Description Paulo Andrade 2018-09-06 15:51:47 UTC
Created attachment 1481351 [details]
malloc_bad_mmap_pthread.c

The test case "appears" to work with newer glibc due to the per thread cache,
but once the cache is exhausted the problem also happens in newer glibc.
The issue happens in rhel7 as well.

  Talking to the user we see that if mmap fails, and switch the thread arena
to main_arena, it likely would converge all threads to the main_arena.

  In a second test case, we also see that on "normal conditions", it would
only switch from main_arena to another arena if mmap fails, even if there is
another arena with free space. It is possible to cause to switch to another
arena if attempting malloc calls again when it returns NULL, but it is
unspecified behaviour; besides it could be a valid logic in a program that retries
because it knows other threads may release memory.

Comment 2 Paulo Andrade 2018-09-06 15:55:28 UTC
Created attachment 1481354 [details]
malloc_nomem.c

  This is the second test case. Showing that it can only switch from
main_arena to another arena, even if there is space on another arena.
This test case is a bit difficult to use, to see how it only switches
from the main_arena, and might need to be run with MALLOC_ARENA_MAX=N
where N is the number of threads, otherwise, it will fail earlier
attempting to create a new arena.

Comment 3 Paulo Andrade 2018-09-06 15:57:15 UTC
  For the first test case, quoting the user, to not loose context:

"""
Summary
malloc performance degradation in specific state after mmap fails

Details
The attached C program reproduces a state where every malloc request uses three mmap system calls, which results in an order of magnitude slowdown in malloc performance.
                                                                                                                                                                                 
Example Transcript:
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.10 (Santiago) 

$ uname -a
Linux hostname 2.6.32-754.2.1.el6.x86_64 #1 SMP Tue Jul 3 16:37:52 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux 

$ /usr/bin/gcc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23)
[...]

$ /usr/bin/gcc -O3 -g -Wall -std=gnu99 -fPIC -m64 malloc_bad_mmap_pthread.c -lpthread -lrt -o malloc_bad_mmap_pthread

$ ./malloc_bad_mmap_pthread
Last 1000 mallocs and frees took 0.000076 seconds (13138705 mallocs/sec)
Last 1000 mallocs and frees took 0.000027 seconds (36837840 mallocs/sec)
Last 1000 mallocs and frees took 0.000027 seconds (36937170 mallocs/sec)
Last 1000 mallocs and frees took 0.000027 seconds (36964477 mallocs/sec)
create_nonmain_arena malloc result 0x7f67bc0008c0
thread result 0x7f67bc0008c0
Got failure after 2100261 blocks
freeing pre_list (from main_arena)
Last 1000 mallocs and frees took 0.002264 seconds (441742 mallocs/sec)
Last 1000 mallocs and frees took 0.002227 seconds (448980 mallocs/sec)
Last 1000 mallocs and frees took 0.002231 seconds (448216 mallocs/sec)
Last 1000 mallocs and frees took 0.002226 seconds (449241 mallocs/sec)

Expected behaviour
The malloc performance at the end should be fairly similar to the performance at the beginning.

Actual behaviour
After reaching the “bad” state, malloc+free is over 80 times slower than the best case (36964477/449241 = 82.3).

Analysis
The problem occurs when the current thread’s most-recently-used “arena” is not the “main” arena, this non-main arena has no free space, the main arena *does* have space, and mmap is failing. In this state every call to malloc uses three system calls. The reason is that malloc tries mmap before falling back to the main arena, but then does not update the most-recently-used arena for the current thread. I further suspect this can cause malloc to fail when in fact other arenas exist with available space, although the test program does not attempt to demonstrate this.

Here’s some relevant code from /usr/src/debug/glibc-2.12-2-gc4ccff1/malloc/malloc.c

Void_t*
public_mALLOc(size_t bytes)
{
...
  arena_get(ar_ptr, bytes);              //// Thread’s most-recently-used arena

  victim = _int_malloc(ar_ptr, bytes);   //// Tries mmap three times
  if(!victim && ar_ptr != NULL) {
    /* Maybe the failure is due to running out of mmapped areas. */
    if(ar_ptr != &main_arena) {
      (void)mutex_unlock(&ar_ptr->mutex);
      ar_ptr = &main_arena;
      (void)mutex_lock(&ar_ptr->mutex);
      victim = _int_malloc(ar_ptr, bytes);  //// Tries the main_arena (which has space)
                                         //// *No* update to thread’s most-recently-used arena
    } else {
...

You can see the pattern of failing mmap calls, where it tries three different sizes each time, in the strace output. For example:

$ strace -e trace=mmap -o strace.log ./malloc_bad_mmap_pthread >/dev/null
$ tail -20 strace.log
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
+++ exited with 0 +++

"""

Comment 4 Carlos O'Donell 2018-09-07 09:33:07 UTC
Paolo,

Thanks for the detailed submission. It will take us a while to work through this issue, and it's likely there is a problem upstream handling the failing mmap (low memory scenario) which has to get fixed first before the backport.

We'll start looking at this.

Comment 10 DJ Delorie 2018-09-21 21:14:30 UTC
I built a version of glibc with a patch that allows threads to migrate to the main heap, and put the source and binary RPMs here:

http://people.redhat.com/dj/bz1626127/

Please try this build and see if the migration change is sufficient to solve your problem.  Note: this is just a test build, and in no way implies that we'll be able to get an official fix in RHEL 6.  Also note: without knowing more about the state of your systems when they're low on resources, I can't say whether this patch will solve your problem or just expose the next one.  Once you start running out of resources, things go downhill pretty quick.

Comment 12 Carlos O'Donell 2018-10-09 01:33:03 UTC
Red Hat Enterprise Linux 6 is in Maintenance Support Phase 2. Only urgent priority bug fixes will be considered.

Any changes to glibc's malloc at the level of algorithmic changes need to go through significant upstream review and involve:

* real world performance testing.

* expert review.

* simulator testing using saved customer workloads (Red Hat proprietary).

All of this may likely preclude any fixing in RHEL6 since the RHEL6 code base is very conservative at this point when it comes to any algorithmic changes.

Lastly the performance of glibc's malloc under low-memory conditions does not really fall under the scope of an urgent priority bug fix for RHEL6. There are several workarounds in this case, including adding more memory to the system, or switching to a maximum of 1 arena to avoid the pessimistic worst case during low-memory (even if this worsens performance overall).

The glibc malloc algorithm has a specific policy of not moving a thread to a different memory pool in the event of a transient failure (mmap failure). That means that any non-main-arena-using threads will only temporarily switch to the main arena before retrying in their own arena. The inverse is not true though, a failing main-arena-using thread will (after sbrk and morecore fail) switch permanently to a mmap-based arena if the main arena fails. It's far more likely that a failure in an mmap arena is temporary and that a failure in an sbrk/morecore-based arena is permanent (usually because you have a mapping the has interrupted the growth of the heap). We understand that this means that when the limit of memory is reached, that performance drops drastically as the thread which is out of memory is unable to find memory, and all the retries fail, and only the eventual main-arena retry is successful. This isn't the behaviour we want overall for the allocator, it would ultimately be better if the threads were globally rebalanced on appropriate arena to reduce both contention and fragmentation. Such an enhancement is outside of the scope of work for RHEL6, but could be considered for upstream.

Relevant allocator metrics for these cases of low-memory are internal fragmentation and external fragmentation, both of which are difficult to compute and require tracking exactly the committed RSS pages and internal allocator details. Efficiency as computed from either VSZ or RSS is not normally used to compare any algorithmic changes. Any rebalancing algorithm work we do upstream will look at fragmentation closely across a variety of workloads. We currently have an ongoing project to look at RSS usage, and we can add this requirement as part of the analysis (looking at rebalancing in low-memory) since it may also help yield lower RSS usage if the rebalancing can happen dynamically over time.

Thank you for testing the testfix packages which have allowed us to verify the problem you are seeing is what we expect. Unfortunately we don't expect that we will be able to make any measurable changes on the RHEL6 memory allocator in the area of algorithmic thread vs arena balancing. The risks of any changes to the allocator are simply too high. For example we did fix upstream bug 19048 (https://sourceware.org/bugzilla/show_bug.cgi?id=19048) in RHEL6, but this was a clear bug fix to the free list handling, and reduced contention without materially impacting the way allocations were handled.

Would you be OK if you moved this bug to RHEL 7 to consider a fix there? 

We would in all cases be working with upstream first.

Comment 16 DJ Delorie 2018-10-12 00:32:24 UTC
I built a version of glibc with a patch that allows you to specify the heap size, and put the source and binary RPMs here (same location, different files):

http://people.redhat.com/dj/bz1626127/

To use this, set the MALLOC_HEAP_MAX environment variable to the desired heap size, which must be a power of two.  Example:

$ export MALLOC_HEAP_MAX=67108864

The above is the default value on 64-bit systems.  Setting it to a smaller value may help threads use a limited address space more efficiently.

Note: this is just a test build, and in no way implies that we'll be able to get an official fix in RHEL 6.

Comment 17 Carlos O'Donell 2018-11-06 17:17:28 UTC
Is there any status update from the customer for this new test build we provided?

We would really like to know if changing the heap sizes has an affect on the their particular use case.

If the customer does not object I would like to move this bug to RHEL 7 or later because that's the only place where we will have the ability to make such structural changes to the allocator.

Comment 18 Paulo Andrade 2018-11-07 14:30:46 UTC
There is no extra feedback about the test build.
But customer did already agree that moving this to rhel7 makes more sense.

Comment 22 Carlos O'Donell 2020-02-24 14:28:40 UTC
We are going to track this issue in the upstream bugzilla for glibc.

I've filed the bug here:
[RFE] Improve malloc performance in low-memory scenarios for threaded arenas.
https://sourceware.org/bugzilla/show_bug.cgi?id=25594

This needs to be fixed upstream in cohesive way that integrates with ongoing malloc improvements.

I'm going to close this issue as CLOSED/UPSTREAM and we will track upstream.


Note You need to log in before you can comment on or make changes to this bug.