Bug 1359877 - unbalanced and poor utilization of memory in glibc arenas may cause memory bloat and subsequent OOM.
Summary: unbalanced and poor utilization of memory in glibc arenas may cause memory bl...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: glibc
Version: 6.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 6.9
Assignee: Florian Weimer
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1269194 1275350
TreeView+ depends on / blocked
 
Reported: 2016-07-25 15:19 UTC by Sumeet Keswani
Modified: 2019-12-16 06:11 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-12 11:08:13 UTC
Target Upstream Version:


Attachments (Terms of Use)
attached is a reproducer and sample output of a run (7.51 MB, application/x-tar)
2016-07-29 19:46 UTC, Sumeet Keswani
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Sourceware 20425 0 None None None 2016-07-30 01:41:37 UTC

Description Sumeet Keswani 2016-07-25 15:19:30 UTC
Description of problem:

Several of our customers upgraded their glibc to get the latest fixes 
https://rhn.redhat.com/errata/RHBA-2016-0834.html

They have now started seeing a memory leaks correlated with the time/date of the glibc upgrade.

I am opening this BZ as a place holder of a highly likely memory leak in glibc-2.12-1.192. I will be adding more information on this shortly.

Version-Release number of selected component (if applicable):
glibc-2.12-1.192


How reproducible:
all customers (4 customers) who upgraded have started seeing memory leaks


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Carlos O'Donell 2016-07-25 15:34:24 UTC
Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7. Could you please clarify exactly which RHEL version you are seeing the issue with?

Comment 2 Sumeet Keswani 2016-07-25 15:41:14 UTC
(In reply to Carlos O'Donell from comment #1)
> Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7.
> Could you please clarify exactly which RHEL version you are seeing the issue
> with?

oops, yes its RHEL6. I am unable to change the Product. (says i dont have access). Can you change the Product to RHEL 6 or should I close this and open a new BZ.

Yes, we are running experiments which already show the leak. 
We are not trying to find out where exactly it is, which would be more useful info to have in order to fix the leak.

Comment 3 Sumeet Keswani 2016-07-25 15:42:26 UTC
We are not trying(In reply to Sumeet Keswani from comment #2)
> (In reply to Carlos O'Donell from comment #1)
> > Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7.
> > Could you please clarify exactly which RHEL version you are seeing the issue
> > with?
> 
> oops, yes its RHEL6. I am unable to change the Product. (says i dont have
> access). Can you change the Product to RHEL 6 or should I close this and
> open a new BZ.
> 
> Yes, we are running experiments which already show the leak. 
> We are not trying to find out where exactly it is, which would be more
> useful info to have in order to fix the leak.

typo fixed....
   We are not trying -> We are now trying

Comment 4 Carlos O'Donell 2016-07-25 16:01:18 UTC
(In reply to Sumeet Keswani from comment #3)
> > Yes, we are running experiments which already show the leak. 
> > We are not trying to find out where exactly it is, which would be more
> > useful info to have in order to fix the leak.
> 
> typo fixed....
>    We are not trying -> We are now trying

Please also be aware that this is a public bug report. If you need this report to be confidential please apply the correct groups.

We look forward to reading your analysis and helping with the issue.

Comment 6 Sumeet Keswani 2016-07-29 15:26:19 UTC
someone else found it too, here is a description.
http://codearcana.com/posts/2016/07/11/arena-leak-in-glibc.html


we now have a reproducers, i will be attaching it shortly

Comment 7 Sumeet Keswani 2016-07-29 19:44:55 UTC
Attached is the reproducer that contains the files needed to reproduce the issue...

pthread_arena.exe.out is a run with stats.
pthread_arena.exe.out.tcl.xls is those stats post-processed.

It's a 24 core machine with 96GiB of memory. 
Thus 192 arenas (8*24). 
The median size of an arena is about 16M. 

If you look at iteration 1 (called set 1), you'll see that all arenas are right around the median. 
I.e., the 500 threads pretty much balanced on the 192 arenas. 

But on the 2nd and subsequent iterations there are several arenas that are 4+ times the median. (and there are a few that dropped to almost nothing). 
What this says is that after the initial population of arenas, the algorithm to choose an arena when one is needed is very poor, 
causing over-subscription on many and under-subscription on a few. 


Impact on the application :
Most database application account for the memory they use.
The application _is_ freeing memory.  The [g]libc allocator is not doing a good job reusing it.

Consequently the amount of memory _used_ by the application far exceeds that what is accounted for by the application.
(i.e. for the application believes it uses 3G but the RSS is actually 7G, due to the poor utilization as a result of this bug)
Consequently this can result in a OOM error/exception to the application when it goes to allocates more memory and there isn't any available on the machine


If subscription could become balanced, that might solve the problem.

Comment 8 Sumeet Keswani 2016-07-29 19:46:04 UTC
Created attachment 1185653 [details]
attached is a reproducer and sample output of a run

Comment 9 Sumeet Keswani 2016-07-29 20:14:32 UTC
I would like to report this bug to the glibc forums at (https://sourceware.org/bugzilla) 
we suspect this is there in the latest version of glibc too.

Do you know how to do that.  for some reason that forum does not seem to be open to new user.

Comment 10 Carlos O'Donell 2016-07-29 21:01:17 UTC
(In reply to Sumeet Keswani from comment #9)
> I would like to report this bug to the glibc forums at
> (https://sourceware.org/bugzilla) 
> we suspect this is there in the latest version of glibc too.
> 
> Do you know how to do that.  for some reason that forum does not seem to be
> open to new user.

Please try again. It was temporarily disabled due to spam issues.

Comment 12 Sumeet Keswani 2016-07-30 01:41:38 UTC
https://sourceware.org/bugzilla/show_bug.cgi?id=20425

Comment 14 Florian Weimer 2016-08-02 09:45:37 UTC
Comment on attachment 1185653 [details]
attached is a reproducer and sample output of a run

Note that the upstream bug https://sourceware.org/bugzilla/show_bug.cgi?id=20424 has the actual reproducer, and discussion continues there.

Comment 16 David Linden 2016-08-05 13:40:19 UTC
Off by one, the upstream bug is https://sourceware.org/bugzilla/show_bug.cgi?id=20425

Comment 22 Florian Weimer 2016-10-12 11:08:13 UTC
The upstream bug is in WAITING state, pending additional information supplied from the reporter.

We cannot address this issue until we have a working reproducer.  As noted in the upstream bug, the observed change in behavior is likely due to a deliberate performance improvement which decreases arena contention, but increases the number of arenas.

Comment 23 Trinh Dao 2016-10-25 14:59:11 UTC
Sumeet, do you agree to RH to close your bug due Insufficient data?

Comment 24 Sumeet Keswani 2017-09-05 01:58:39 UTC
yes the observed change in behavior is due to a deliberate performance improvement. In glibc the use arenas improves concurrent performance (by design). 

But this leads to two problems. 

first of all the application memory footprint significantly increases. this would have been fine if this were the only issue.

second, within an arena, glibc does not return memory to the kernel as you would expect. it request more memory when significant parts of the arena are free (and could be potentially be reusable/available).

The above two together leads to an application to OOM,  compared to older version of glibc which did not have this performance improvement. hence it appears as a regression or bug to may of our users.

we have instrumented and profiled our application and tested it with different allocators (jemalloc in particular). for the exact same workload, using jemalloc and the memory use is flat, suggesting something in glibc is just not reusing memory when it should.

This suggests that the algorithm that is used to request more memory for a given arena is flawed, memory is requested from the OS even when there is memory available/free. this in conjunction with the performance improvement which creates more arenas results in OOMs.

I am not sure how to address this. clearly this must come from upstream, but resolving this requires observing and profiling an application over days. A standalone reproducer was attempted, but the glibc/upstream required it be produced with the latest version of glibc which most of our customers don't use.


Note You need to log in before you can comment on or make changes to this bug.