This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2102320 - Request for comments/feedback: tcache and memory usage not shrinking
Summary: Request for comments/feedback: tcache and memory usage not shrinking
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: glibc
Version: 8.6
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: glibc team
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-29 16:41 UTC by Paulo Andrade
Modified: 2023-09-11 17:13 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-11 17:13:28 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
glibc-2.28-sfdc03235605.patch (1.12 KB, patch)
2022-06-29 19:29 UTC, Paulo Andrade
no flags Details | Diff
glibc-2.28-sfdc03235605.patch (1.19 KB, patch)
2022-06-29 23:48 UTC, Paulo Andrade
no flags Details | Diff
glibc-2.28-sfdc03235605.patch (1.24 KB, patch)
2022-07-12 13:53 UTC, Paulo Andrade
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   RHEL-3007 0 None Migrated None 2023-09-11 17:03:41 UTC
Red Hat Issue Tracker RHELPLAN-126644 0 None None None 2022-06-29 17:12:22 UTC
Sourceware 26969 0 P1 UNCONFIRMED A common malloc pattern can make memory not given back to OS 2022-06-29 16:46:36 UTC

Description Paulo Andrade 2022-06-29 16:41:56 UTC
When tcache is enabled, memory tends to not shrink very well, but
be fragmented in long running threads with several different malloc
allocation patterns.

  Closest similar issue I did find was
https://bugs.linaro.org/show_bug.cgi?id=3950
[LKFT: libhugetlbfs: heapshrink-2M-64 failed]

  While talking with the user experimenting the problem the first idea
I have, but just heuristics, could be to change _int_free():

	if (tcache->counts[tc_idx] < mp_.tcache_count)
	  {
	    tcache_put (p, tc_idx);
	    return;
	  }

to swap the 'p' argument with the oldest entry in the tcache, and not
'return'. Checking if any other local variable needs an adjust as well,
at least the 'size' variable.

Comment 1 Siddhesh Poyarekar 2022-06-29 16:56:08 UTC
In general there are a lot of patterns out there that will defeat any kind of caching in the malloc implementation, from tcache to bins to arenas.  It would be helpful to have a reproducer, an explanation of how that reproducer may represent a specific class of workloads and then how your proposed change will improve things for that class of workloads.  Could you please rephrase your problem in that context?

Comment 2 Siddhesh Poyarekar 2022-06-29 16:57:51 UTC
(In reply to Siddhesh Poyarekar from comment #1)
> of caching in the malloc implementation, from tcache to bins to arenas.  It
> would be helpful to have a reproducer, an explanation of how that reproducer
> may represent a specific class of workloads and then how your proposed
> change will improve things for that class of workloads...

... without making it significantly worse for other classes of workloads of course.

Comment 4 Paulo Andrade 2022-06-29 19:29:24 UTC
Created attachment 1893474 [details]
glibc-2.28-sfdc03235605.patch

I believe I am not fully understanding how tcache works. Or if my idea
even works.

A test build with this patch crashes when loading glibc.

The problem should be in the:

+	    /* Now swap the last tcache entry with the value we were
+	       free'ing. */
+	    p = mem2chunk (tmp);
+	    size = chunksize (p);

chunk.

Can you please enlighten me :)

Comment 5 DJ Delorie 2022-06-29 19:46:00 UTC
A summary of how the tcache is stored can be found in the "Thread Local Cache" section of:
https://sourceware.org/glibc/wiki/MallocInternals

The existing tcache macros only act on the head of the linked list.  Your idea accesses chunks at the tail of that list, which is more complex.

Comment 6 Cliff Romash 2022-06-29 23:30:31 UTC
I'm the customer who opened the RedHat ticket and posted a sample program that reproduces the issue.

I explicitly wrote a very simple example to demonstrate our problem, but in our application testing, the problem does not occur until we force stress that forces high memory utilization. The tcache serves to make this utilization permanent from the system point of view.

We have a large application developed over many years, with multiple long-lived threads which box large and small blocks. If we run at overload, we see rapid memory growth, but with RHEL 7 (glibc 2.17) the growth stabilizes and when the load is stopped RSS for the process shrinks. With RHEL 8 (glibc 2.28) the stabilization does not happen, and if we stop the load, the RSS never shrinks.

When I examine the heaps for the threads that were processing the load, I see several heaps per arena, with the heaps consisting of a large free block followed by a small (below tcache max) block that is in the tcache, followed by top. Trimming during free does not happen because of the tcache'd block. Calling mtrim will free memory because of mtrim's more agreessive trimming strategy, but the trimming done from _int_free does not work.

The issue arising here, I think, is that the per-thread cache blocks are "in use"  as far as the rest of the heap, and so when an arena's heap top is small and above a per-thread cache block, the normal trimming can't happen.


Here's my reasoning:

The tcache'd  blocks are inuse as far as the normal heap code. They are not in fastbins, nor are they in bins.
The normal trim code in _int_free will not trim below the highest block below top in an arena, which is stopped by a tcache'd block near the top of the arena.

I'll note that the fastbins are implemented in a similar fashion, and would have the same problem, except there is a call to malloc_consolidate right before the trim code in _int_free. This call is conditioned on the size of the chunk being freed (after it is coalesced with those before and after) as follows:
                    if ((unsigned long)(size) >= FASTBIN_CONSOLIDATION_THRESHOLD) {

described with this comment:  
/*
      If freeing a large space, consolidate possibly-surrounding
      chunks. Then, if the total unused topmost memory exceeds trim
      threshold, ask malloc_trim to reduce top.

      Unless max_fast is 0, we don't know if there are fastbins
      bordering top, so we cannot tell for sure whether threshold
      has been reached unless fastbins are consolidated.  But we
      don't want to consolidate on each free.  As a compromise,
      consolidation is performed if FASTBIN_CONSOLIDATION_THRESHOLD
      is reached.
    */

But there is nothing like this for tcache, so a tcache'd chunk prevents the trim, since it won't be coalesced/consolidated with top.

If I (whose experience with glibc's malloc is limited to several applications from my employer only) were suggesting a fix, I'd consider something executed just before the trimming code in _int_free that looks at the block just before ar->top and if the block before that is free (and large?) attempts to remove the block from the tcache and mark it free and coalesce it with the preceding free block. I'm not sure this is possible working back from ar->top so it might be necessary to walk thru the tcache to find such blocks, which is of course, more expensive.  Of course, all such fixes are trading future malloc efficiency for current system memory usage. In our long-lived applications, the memory usage is more critical, but I can easily envision many cases where that is not the case.

Comment 7 Cliff Romash 2022-06-29 23:30:31 UTC
I'm the customer who opened the RedHat ticket and posted a sample program that reproduces the issue.

I explicitly wrote a very simple example to demonstrate our problem, but in our application testing, the problem does not occur until we force stress that forces high memory utilization. The tcache serves to make this utilization permanent from the system point of view.

We have a large application developed over many years, with multiple long-lived threads which box large and small blocks. If we run at overload, we see rapid memory growth, but with RHEL 7 (glibc 2.17) the growth stabilizes and when the load is stopped RSS for the process shrinks. With RHEL 8 (glibc 2.28) the stabilization does not happen, and if we stop the load, the RSS never shrinks.

When I examine the heaps for the threads that were processing the load, I see several heaps per arena, with the heaps consisting of a large free block followed by a small (below tcache max) block that is in the tcache, followed by top. Trimming during free does not happen because of the tcache'd block. Calling mtrim will free memory because of mtrim's more agreessive trimming strategy, but the trimming done from _int_free does not work.

The issue arising here, I think, is that the per-thread cache blocks are "in use"  as far as the rest of the heap, and so when an arena's heap top is small and above a per-thread cache block, the normal trimming can't happen.


Here's my reasoning:

The tcache'd  blocks are inuse as far as the normal heap code. They are not in fastbins, nor are they in bins.
The normal trim code in _int_free will not trim below the highest block below top in an arena, which is stopped by a tcache'd block near the top of the arena.

I'll note that the fastbins are implemented in a similar fashion, and would have the same problem, except there is a call to malloc_consolidate right before the trim code in _int_free. This call is conditioned on the size of the chunk being freed (after it is coalesced with those before and after) as follows:
                    if ((unsigned long)(size) >= FASTBIN_CONSOLIDATION_THRESHOLD) {

described with this comment:  
/*
      If freeing a large space, consolidate possibly-surrounding
      chunks. Then, if the total unused topmost memory exceeds trim
      threshold, ask malloc_trim to reduce top.

      Unless max_fast is 0, we don't know if there are fastbins
      bordering top, so we cannot tell for sure whether threshold
      has been reached unless fastbins are consolidated.  But we
      don't want to consolidate on each free.  As a compromise,
      consolidation is performed if FASTBIN_CONSOLIDATION_THRESHOLD
      is reached.
    */

But there is nothing like this for tcache, so a tcache'd chunk prevents the trim, since it won't be coalesced/consolidated with top.

If I (whose experience with glibc's malloc is limited to several applications from my employer only) were suggesting a fix, I'd consider something executed just before the trimming code in _int_free that looks at the block just before ar->top and if the block before that is free (and large?) attempts to remove the block from the tcache and mark it free and coalesce it with the preceding free block. I'm not sure this is possible working back from ar->top so it might be necessary to walk thru the tcache to find such blocks, which is of course, more expensive.  Of course, all such fixes are trading future malloc efficiency for current system memory usage. In our long-lived applications, the memory usage is more critical, but I can easily envision many cases where that is not the case.

Comment 8 Paulo Andrade 2022-06-29 23:32:30 UTC
  The bug in my test patch was not setting the last entry in tcache->counts[tc_idx],
prior to the (real last) one being removed, next field to NULL.

  That caused a subsequent free to traverse all tcache entries, and think there
was a double free.

  I am testing a new patch that also keeps a 'prev' pointer, to set the next
field to NULL.

  The patch would be simpler if mp_.tcache_count is either zero or larger than
one. Need proper testing to check if it works if mp_.tcache_count is one.

  Test build finishing and with default values it appears to work during glibc
build.

  But it is just an heuristic idea of what might work to prevent the pattern on
the reports of memory usage not shrinking, as a somewhat simple patch.

Comment 9 Paulo Andrade 2022-06-29 23:48:41 UTC
Created attachment 1893499 [details]
glibc-2.28-sfdc03235605.patch

Probably there is a bug in the somewhat confusing case of 'prev' == 'tmp',
that would happen if 'mp_.tcache_count == 1', but this finished a glibc
build.

I might also be missing something elsewhere. But basic testing of the
built glibc should show it.

Comment 10 DJ Delorie 2022-06-30 02:42:16 UTC
Note that in RHEL 8, the tcache can be disabled via tunables:

 Tunable glibc.malloc.tcache_count

 The maximum number of chunks of each size to cache. The default is 7.
 The upper limit is 65535.  If set to zero, the per-thread cache is effectively
 disabled.

Likewise fastbins, which may also affect this, can be disabled:

 Tunable glibc.malloc.mxfast

 One of the optimizations malloc uses is to maintain a series of ``fast
 bins'' that hold chunks up to a specific size.  The default and
 maximum size which may be held this way is 80 bytes on 32-bit systems
 or 160 bytes on 64-bit systems.  Applications which value size over
 speed may choose to reduce the size of requests which are serviced
 from fast bins with this tunable.  Note that the value specified
 includes malloc's internal overhead, which is normally the size of one
 pointer, so add 4 on 32-bit systems or 8 on 64-bit systems to the size
 passed to @code{malloc} for the largest bin size to enable.

These can be set in the environment like this:

export GLIBC_TUNABLES=glibc.malloc.tcache_count=0:glibc.malloc.mxfast=0

Comment 11 Cliff Romash 2022-06-30 03:32:57 UTC
DJ,

Thanks for the comments. We are using the tunable to work around this. I filed the bug because I believe this is not correct behavior. We've never seen the issue with fastbins, but that is because malloc_consolidate moves the blocks out of fastbins allowing a trim. That's why I think some sort of implementation that moves blocks out of tcache when we are going to try to trim is the way to improve this.

I'm going to try Paulo's fix as well as try to code my own fix tomorrow (it may take longer than that :)  )

Comment 12 DJ Delorie 2022-06-30 05:01:18 UTC
Note that, due to their nature, no one thread can access the other thread's tcaches.  This will make consolidating them trickier.

Comment 13 Paulo Andrade 2022-07-01 14:10:57 UTC
Maybe would need an extra mallopt option to force drop of tcache.

For short lived threads this should not be much useful.

Having different tunables, or adaptation of the current ones, to
have tcache with different values for different threads would
also be overkill.

An option to somehow disable tcache for the thread would be useful,
maybe somehow call tcache_thread_shutdown(), then reset state and
let tcache restart.

Or something wild, like a free(0) being a hint to flush the tcache;
likely not a good idea... would need something different.

Comment 14 Paulo Andrade 2022-07-12 13:53:50 UTC
Created attachment 1896379 [details]
glibc-2.28-sfdc03235605.patch

  A glibc package with the test patch, to attempt to not keep memory stale
in the tcache unfortunately does not appear to reduce memory usage.

  Probably the best approach is to have some way for the long living thread
to specifically ask for drop of the tcache, or have tcache have different
parameters for different threads, or, maybe just disable tcache...

Comment 17 RHEL Program Management 2023-09-11 17:02:08 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 18 RHEL Program Management 2023-09-11 17:13:28 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.


Note You need to log in before you can comment on or make changes to this bug.