Bug 1359877

Summary: unbalanced and poor utilization of memory in glibc arenas may cause memory bloat and subsequent OOM.
Product: Red Hat Enterprise Linux 6 Reporter: Sumeet Keswani <sumeet.keswani>
Component: glibcAssignee: Florian Weimer <fweimer>
Status: CLOSED INSUFFICIENT_DATA QA Contact: qe-baseos-tools-bugs
Severity: high Docs Contact:
Priority: high    
Version: 6.8CC: ashankar, cww, david.linden, fweimer, jkachuck, jscalf, mknutson, mnewsome, pfrankli, qguo, sumeet.keswani, trinh.dao
Target Milestone: rc   
Target Release: 6.9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-12 11:08:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1269194, 1275350    
Attachments:
Description Flags
attached is a reproducer and sample output of a run none

Description Sumeet Keswani 2016-07-25 15:19:30 UTC
Description of problem:

Several of our customers upgraded their glibc to get the latest fixes 
https://rhn.redhat.com/errata/RHBA-2016-0834.html

They have now started seeing a memory leaks correlated with the time/date of the glibc upgrade.

I am opening this BZ as a place holder of a highly likely memory leak in glibc-2.12-1.192. I will be adding more information on this shortly.

Version-Release number of selected component (if applicable):
glibc-2.12-1.192


How reproducible:
all customers (4 customers) who upgraded have started seeing memory leaks


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Carlos O'Donell 2016-07-25 15:34:24 UTC
Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7. Could you please clarify exactly which RHEL version you are seeing the issue with?

Comment 2 Sumeet Keswani 2016-07-25 15:41:14 UTC
(In reply to Carlos O'Donell from comment #1)
> Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7.
> Could you please clarify exactly which RHEL version you are seeing the issue
> with?

oops, yes its RHEL6. I am unable to change the Product. (says i dont have access). Can you change the Product to RHEL 6 or should I close this and open a new BZ.

Yes, we are running experiments which already show the leak. 
We are not trying to find out where exactly it is, which would be more useful info to have in order to fix the leak.

Comment 3 Sumeet Keswani 2016-07-25 15:42:26 UTC
We are not trying(In reply to Sumeet Keswani from comment #2)
> (In reply to Carlos O'Donell from comment #1)
> > Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7.
> > Could you please clarify exactly which RHEL version you are seeing the issue
> > with?
> 
> oops, yes its RHEL6. I am unable to change the Product. (says i dont have
> access). Can you change the Product to RHEL 6 or should I close this and
> open a new BZ.
> 
> Yes, we are running experiments which already show the leak. 
> We are not trying to find out where exactly it is, which would be more
> useful info to have in order to fix the leak.

typo fixed....
   We are not trying -> We are now trying

Comment 4 Carlos O'Donell 2016-07-25 16:01:18 UTC
(In reply to Sumeet Keswani from comment #3)
> > Yes, we are running experiments which already show the leak. 
> > We are not trying to find out where exactly it is, which would be more
> > useful info to have in order to fix the leak.
> 
> typo fixed....
>    We are not trying -> We are now trying

Please also be aware that this is a public bug report. If you need this report to be confidential please apply the correct groups.

We look forward to reading your analysis and helping with the issue.

Comment 6 Sumeet Keswani 2016-07-29 15:26:19 UTC
someone else found it too, here is a description.
http://codearcana.com/posts/2016/07/11/arena-leak-in-glibc.html


we now have a reproducers, i will be attaching it shortly

Comment 7 Sumeet Keswani 2016-07-29 19:44:55 UTC
Attached is the reproducer that contains the files needed to reproduce the issue...

pthread_arena.exe.out is a run with stats.
pthread_arena.exe.out.tcl.xls is those stats post-processed.

It's a 24 core machine with 96GiB of memory. 
Thus 192 arenas (8*24). 
The median size of an arena is about 16M. 

If you look at iteration 1 (called set 1), you'll see that all arenas are right around the median. 
I.e., the 500 threads pretty much balanced on the 192 arenas. 

But on the 2nd and subsequent iterations there are several arenas that are 4+ times the median. (and there are a few that dropped to almost nothing). 
What this says is that after the initial population of arenas, the algorithm to choose an arena when one is needed is very poor, 
causing over-subscription on many and under-subscription on a few. 


Impact on the application :
Most database application account for the memory they use.
The application _is_ freeing memory.  The [g]libc allocator is not doing a good job reusing it.

Consequently the amount of memory _used_ by the application far exceeds that what is accounted for by the application.
(i.e. for the application believes it uses 3G but the RSS is actually 7G, due to the poor utilization as a result of this bug)
Consequently this can result in a OOM error/exception to the application when it goes to allocates more memory and there isn't any available on the machine


If subscription could become balanced, that might solve the problem.

Comment 8 Sumeet Keswani 2016-07-29 19:46:04 UTC
Created attachment 1185653 [details]
attached is a reproducer and sample output of a run

Comment 9 Sumeet Keswani 2016-07-29 20:14:32 UTC
I would like to report this bug to the glibc forums at (https://sourceware.org/bugzilla) 
we suspect this is there in the latest version of glibc too.

Do you know how to do that.  for some reason that forum does not seem to be open to new user.

Comment 10 Carlos O'Donell 2016-07-29 21:01:17 UTC
(In reply to Sumeet Keswani from comment #9)
> I would like to report this bug to the glibc forums at
> (https://sourceware.org/bugzilla) 
> we suspect this is there in the latest version of glibc too.
> 
> Do you know how to do that.  for some reason that forum does not seem to be
> open to new user.

Please try again. It was temporarily disabled due to spam issues.

Comment 12 Sumeet Keswani 2016-07-30 01:41:38 UTC
https://sourceware.org/bugzilla/show_bug.cgi?id=20425

Comment 14 Florian Weimer 2016-08-02 09:45:37 UTC
Comment on attachment 1185653 [details]
attached is a reproducer and sample output of a run

Note that the upstream bug https://sourceware.org/bugzilla/show_bug.cgi?id=20424 has the actual reproducer, and discussion continues there.

Comment 16 David Linden 2016-08-05 13:40:19 UTC
Off by one, the upstream bug is https://sourceware.org/bugzilla/show_bug.cgi?id=20425

Comment 22 Florian Weimer 2016-10-12 11:08:13 UTC
The upstream bug is in WAITING state, pending additional information supplied from the reporter.

We cannot address this issue until we have a working reproducer.  As noted in the upstream bug, the observed change in behavior is likely due to a deliberate performance improvement which decreases arena contention, but increases the number of arenas.

Comment 23 Trinh Dao 2016-10-25 14:59:11 UTC
Sumeet, do you agree to RH to close your bug due Insufficient data?

Comment 24 Sumeet Keswani 2017-09-05 01:58:39 UTC
yes the observed change in behavior is due to a deliberate performance improvement. In glibc the use arenas improves concurrent performance (by design). 

But this leads to two problems. 

first of all the application memory footprint significantly increases. this would have been fine if this were the only issue.

second, within an arena, glibc does not return memory to the kernel as you would expect. it request more memory when significant parts of the arena are free (and could be potentially be reusable/available).

The above two together leads to an application to OOM,  compared to older version of glibc which did not have this performance improvement. hence it appears as a regression or bug to may of our users.

we have instrumented and profiled our application and tested it with different allocators (jemalloc in particular). for the exact same workload, using jemalloc and the memory use is flat, suggesting something in glibc is just not reusing memory when it should.

This suggests that the algorithm that is used to request more memory for a given arena is flawed, memory is requested from the OS even when there is memory available/free. this in conjunction with the performance improvement which creates more arenas results in OOMs.

I am not sure how to address this. clearly this must come from upstream, but resolving this requires observing and profiling an application over days. A standalone reproducer was attempted, but the glibc/upstream required it be produced with the latest version of glibc which most of our customers don't use.