Bug 2102320
| Summary: | Request for comments/feedback: tcache and memory usage not shrinking | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Paulo Andrade <pandrade> | ||||||||
| Component: | glibc | Assignee: | glibc team <glibc-bugzilla> | ||||||||
| Status: | NEW --- | QA Contact: | qe-baseos-tools-bugs | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 8.6 | CC: | aogburn, ashankar, codonell, dj, fweimer, mmillson, pfrankli, romash, sipoyare | ||||||||
| Target Milestone: | rc | Keywords: | Triaged | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | Type: | Bug | |||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Paulo Andrade
2022-06-29 16:41:56 UTC
In general there are a lot of patterns out there that will defeat any kind of caching in the malloc implementation, from tcache to bins to arenas. It would be helpful to have a reproducer, an explanation of how that reproducer may represent a specific class of workloads and then how your proposed change will improve things for that class of workloads. Could you please rephrase your problem in that context? (In reply to Siddhesh Poyarekar from comment #1) > of caching in the malloc implementation, from tcache to bins to arenas. It > would be helpful to have a reproducer, an explanation of how that reproducer > may represent a specific class of workloads and then how your proposed > change will improve things for that class of workloads... ... without making it significantly worse for other classes of workloads of course. Created attachment 1893474 [details]
glibc-2.28-sfdc03235605.patch
I believe I am not fully understanding how tcache works. Or if my idea
even works.
A test build with this patch crashes when loading glibc.
The problem should be in the:
+ /* Now swap the last tcache entry with the value we were
+ free'ing. */
+ p = mem2chunk (tmp);
+ size = chunksize (p);
chunk.
Can you please enlighten me :)
A summary of how the tcache is stored can be found in the "Thread Local Cache" section of: https://sourceware.org/glibc/wiki/MallocInternals The existing tcache macros only act on the head of the linked list. Your idea accesses chunks at the tail of that list, which is more complex. I'm the customer who opened the RedHat ticket and posted a sample program that reproduces the issue.
I explicitly wrote a very simple example to demonstrate our problem, but in our application testing, the problem does not occur until we force stress that forces high memory utilization. The tcache serves to make this utilization permanent from the system point of view.
We have a large application developed over many years, with multiple long-lived threads which box large and small blocks. If we run at overload, we see rapid memory growth, but with RHEL 7 (glibc 2.17) the growth stabilizes and when the load is stopped RSS for the process shrinks. With RHEL 8 (glibc 2.28) the stabilization does not happen, and if we stop the load, the RSS never shrinks.
When I examine the heaps for the threads that were processing the load, I see several heaps per arena, with the heaps consisting of a large free block followed by a small (below tcache max) block that is in the tcache, followed by top. Trimming during free does not happen because of the tcache'd block. Calling mtrim will free memory because of mtrim's more agreessive trimming strategy, but the trimming done from _int_free does not work.
The issue arising here, I think, is that the per-thread cache blocks are "in use" as far as the rest of the heap, and so when an arena's heap top is small and above a per-thread cache block, the normal trimming can't happen.
Here's my reasoning:
The tcache'd blocks are inuse as far as the normal heap code. They are not in fastbins, nor are they in bins.
The normal trim code in _int_free will not trim below the highest block below top in an arena, which is stopped by a tcache'd block near the top of the arena.
I'll note that the fastbins are implemented in a similar fashion, and would have the same problem, except there is a call to malloc_consolidate right before the trim code in _int_free. This call is conditioned on the size of the chunk being freed (after it is coalesced with those before and after) as follows:
if ((unsigned long)(size) >= FASTBIN_CONSOLIDATION_THRESHOLD) {
described with this comment:
/*
If freeing a large space, consolidate possibly-surrounding
chunks. Then, if the total unused topmost memory exceeds trim
threshold, ask malloc_trim to reduce top.
Unless max_fast is 0, we don't know if there are fastbins
bordering top, so we cannot tell for sure whether threshold
has been reached unless fastbins are consolidated. But we
don't want to consolidate on each free. As a compromise,
consolidation is performed if FASTBIN_CONSOLIDATION_THRESHOLD
is reached.
*/
But there is nothing like this for tcache, so a tcache'd chunk prevents the trim, since it won't be coalesced/consolidated with top.
If I (whose experience with glibc's malloc is limited to several applications from my employer only) were suggesting a fix, I'd consider something executed just before the trimming code in _int_free that looks at the block just before ar->top and if the block before that is free (and large?) attempts to remove the block from the tcache and mark it free and coalesce it with the preceding free block. I'm not sure this is possible working back from ar->top so it might be necessary to walk thru the tcache to find such blocks, which is of course, more expensive. Of course, all such fixes are trading future malloc efficiency for current system memory usage. In our long-lived applications, the memory usage is more critical, but I can easily envision many cases where that is not the case.
I'm the customer who opened the RedHat ticket and posted a sample program that reproduces the issue.
I explicitly wrote a very simple example to demonstrate our problem, but in our application testing, the problem does not occur until we force stress that forces high memory utilization. The tcache serves to make this utilization permanent from the system point of view.
We have a large application developed over many years, with multiple long-lived threads which box large and small blocks. If we run at overload, we see rapid memory growth, but with RHEL 7 (glibc 2.17) the growth stabilizes and when the load is stopped RSS for the process shrinks. With RHEL 8 (glibc 2.28) the stabilization does not happen, and if we stop the load, the RSS never shrinks.
When I examine the heaps for the threads that were processing the load, I see several heaps per arena, with the heaps consisting of a large free block followed by a small (below tcache max) block that is in the tcache, followed by top. Trimming during free does not happen because of the tcache'd block. Calling mtrim will free memory because of mtrim's more agreessive trimming strategy, but the trimming done from _int_free does not work.
The issue arising here, I think, is that the per-thread cache blocks are "in use" as far as the rest of the heap, and so when an arena's heap top is small and above a per-thread cache block, the normal trimming can't happen.
Here's my reasoning:
The tcache'd blocks are inuse as far as the normal heap code. They are not in fastbins, nor are they in bins.
The normal trim code in _int_free will not trim below the highest block below top in an arena, which is stopped by a tcache'd block near the top of the arena.
I'll note that the fastbins are implemented in a similar fashion, and would have the same problem, except there is a call to malloc_consolidate right before the trim code in _int_free. This call is conditioned on the size of the chunk being freed (after it is coalesced with those before and after) as follows:
if ((unsigned long)(size) >= FASTBIN_CONSOLIDATION_THRESHOLD) {
described with this comment:
/*
If freeing a large space, consolidate possibly-surrounding
chunks. Then, if the total unused topmost memory exceeds trim
threshold, ask malloc_trim to reduce top.
Unless max_fast is 0, we don't know if there are fastbins
bordering top, so we cannot tell for sure whether threshold
has been reached unless fastbins are consolidated. But we
don't want to consolidate on each free. As a compromise,
consolidation is performed if FASTBIN_CONSOLIDATION_THRESHOLD
is reached.
*/
But there is nothing like this for tcache, so a tcache'd chunk prevents the trim, since it won't be coalesced/consolidated with top.
If I (whose experience with glibc's malloc is limited to several applications from my employer only) were suggesting a fix, I'd consider something executed just before the trimming code in _int_free that looks at the block just before ar->top and if the block before that is free (and large?) attempts to remove the block from the tcache and mark it free and coalesce it with the preceding free block. I'm not sure this is possible working back from ar->top so it might be necessary to walk thru the tcache to find such blocks, which is of course, more expensive. Of course, all such fixes are trading future malloc efficiency for current system memory usage. In our long-lived applications, the memory usage is more critical, but I can easily envision many cases where that is not the case.
The bug in my test patch was not setting the last entry in tcache->counts[tc_idx], prior to the (real last) one being removed, next field to NULL. That caused a subsequent free to traverse all tcache entries, and think there was a double free. I am testing a new patch that also keeps a 'prev' pointer, to set the next field to NULL. The patch would be simpler if mp_.tcache_count is either zero or larger than one. Need proper testing to check if it works if mp_.tcache_count is one. Test build finishing and with default values it appears to work during glibc build. But it is just an heuristic idea of what might work to prevent the pattern on the reports of memory usage not shrinking, as a somewhat simple patch. Created attachment 1893499 [details]
glibc-2.28-sfdc03235605.patch
Probably there is a bug in the somewhat confusing case of 'prev' == 'tmp',
that would happen if 'mp_.tcache_count == 1', but this finished a glibc
build.
I might also be missing something elsewhere. But basic testing of the
built glibc should show it.
Note that in RHEL 8, the tcache can be disabled via tunables:
Tunable glibc.malloc.tcache_count
The maximum number of chunks of each size to cache. The default is 7.
The upper limit is 65535. If set to zero, the per-thread cache is effectively
disabled.
Likewise fastbins, which may also affect this, can be disabled:
Tunable glibc.malloc.mxfast
One of the optimizations malloc uses is to maintain a series of ``fast
bins'' that hold chunks up to a specific size. The default and
maximum size which may be held this way is 80 bytes on 32-bit systems
or 160 bytes on 64-bit systems. Applications which value size over
speed may choose to reduce the size of requests which are serviced
from fast bins with this tunable. Note that the value specified
includes malloc's internal overhead, which is normally the size of one
pointer, so add 4 on 32-bit systems or 8 on 64-bit systems to the size
passed to @code{malloc} for the largest bin size to enable.
These can be set in the environment like this:
export GLIBC_TUNABLES=glibc.malloc.tcache_count=0:glibc.malloc.mxfast=0
DJ, Thanks for the comments. We are using the tunable to work around this. I filed the bug because I believe this is not correct behavior. We've never seen the issue with fastbins, but that is because malloc_consolidate moves the blocks out of fastbins allowing a trim. That's why I think some sort of implementation that moves blocks out of tcache when we are going to try to trim is the way to improve this. I'm going to try Paulo's fix as well as try to code my own fix tomorrow (it may take longer than that :) ) Note that, due to their nature, no one thread can access the other thread's tcaches. This will make consolidating them trickier. Maybe would need an extra mallopt option to force drop of tcache. For short lived threads this should not be much useful. Having different tunables, or adaptation of the current ones, to have tcache with different values for different threads would also be overkill. An option to somehow disable tcache for the thread would be useful, maybe somehow call tcache_thread_shutdown(), then reset state and let tcache restart. Or something wild, like a free(0) being a hint to flush the tcache; likely not a good idea... would need something different. Created attachment 1896379 [details]
glibc-2.28-sfdc03235605.patch
A glibc package with the test patch, to attempt to not keep memory stale
in the tcache unfortunately does not appear to reduce memory usage.
Probably the best approach is to have some way for the long living thread
to specifically ask for drop of the tcache, or have tcache have different
parameters for different threads, or, maybe just disable tcache...
|