Bug 1359877
Summary: | unbalanced and poor utilization of memory in glibc arenas may cause memory bloat and subsequent OOM. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Sumeet Keswani <sumeet.keswani> | ||||
Component: | glibc | Assignee: | Florian Weimer <fweimer> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | qe-baseos-tools-bugs | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.8 | CC: | ashankar, cww, david.linden, fweimer, jkachuck, jscalf, mknutson, mnewsome, pfrankli, qguo, sumeet.keswani, trinh.dao | ||||
Target Milestone: | rc | ||||||
Target Release: | 6.9 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-10-12 11:08:13 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1269194, 1275350 | ||||||
Attachments: |
|
Description
Sumeet Keswani
2016-07-25 15:19:30 UTC
Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7. Could you please clarify exactly which RHEL version you are seeing the issue with? (In reply to Carlos O'Donell from comment #1) > Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7. > Could you please clarify exactly which RHEL version you are seeing the issue > with? oops, yes its RHEL6. I am unable to change the Product. (says i dont have access). Can you change the Product to RHEL 6 or should I close this and open a new BZ. Yes, we are running experiments which already show the leak. We are not trying to find out where exactly it is, which would be more useful info to have in order to fix the leak. We are not trying(In reply to Sumeet Keswani from comment #2) > (In reply to Carlos O'Donell from comment #1) > > Please note that glibc-2.12-1 is RHEL6, but this issue indicates RHEL7. > > Could you please clarify exactly which RHEL version you are seeing the issue > > with? > > oops, yes its RHEL6. I am unable to change the Product. (says i dont have > access). Can you change the Product to RHEL 6 or should I close this and > open a new BZ. > > Yes, we are running experiments which already show the leak. > We are not trying to find out where exactly it is, which would be more > useful info to have in order to fix the leak. typo fixed.... We are not trying -> We are now trying (In reply to Sumeet Keswani from comment #3) > > Yes, we are running experiments which already show the leak. > > We are not trying to find out where exactly it is, which would be more > > useful info to have in order to fix the leak. > > typo fixed.... > We are not trying -> We are now trying Please also be aware that this is a public bug report. If you need this report to be confidential please apply the correct groups. We look forward to reading your analysis and helping with the issue. someone else found it too, here is a description. http://codearcana.com/posts/2016/07/11/arena-leak-in-glibc.html we now have a reproducers, i will be attaching it shortly Attached is the reproducer that contains the files needed to reproduce the issue... pthread_arena.exe.out is a run with stats. pthread_arena.exe.out.tcl.xls is those stats post-processed. It's a 24 core machine with 96GiB of memory. Thus 192 arenas (8*24). The median size of an arena is about 16M. If you look at iteration 1 (called set 1), you'll see that all arenas are right around the median. I.e., the 500 threads pretty much balanced on the 192 arenas. But on the 2nd and subsequent iterations there are several arenas that are 4+ times the median. (and there are a few that dropped to almost nothing). What this says is that after the initial population of arenas, the algorithm to choose an arena when one is needed is very poor, causing over-subscription on many and under-subscription on a few. Impact on the application : Most database application account for the memory they use. The application _is_ freeing memory. The [g]libc allocator is not doing a good job reusing it. Consequently the amount of memory _used_ by the application far exceeds that what is accounted for by the application. (i.e. for the application believes it uses 3G but the RSS is actually 7G, due to the poor utilization as a result of this bug) Consequently this can result in a OOM error/exception to the application when it goes to allocates more memory and there isn't any available on the machine If subscription could become balanced, that might solve the problem. Created attachment 1185653 [details]
attached is a reproducer and sample output of a run
I would like to report this bug to the glibc forums at (https://sourceware.org/bugzilla) we suspect this is there in the latest version of glibc too. Do you know how to do that. for some reason that forum does not seem to be open to new user. (In reply to Sumeet Keswani from comment #9) > I would like to report this bug to the glibc forums at > (https://sourceware.org/bugzilla) > we suspect this is there in the latest version of glibc too. > > Do you know how to do that. for some reason that forum does not seem to be > open to new user. Please try again. It was temporarily disabled due to spam issues. Comment on attachment 1185653 [details] attached is a reproducer and sample output of a run Note that the upstream bug https://sourceware.org/bugzilla/show_bug.cgi?id=20424 has the actual reproducer, and discussion continues there. Off by one, the upstream bug is https://sourceware.org/bugzilla/show_bug.cgi?id=20425 The upstream bug is in WAITING state, pending additional information supplied from the reporter. We cannot address this issue until we have a working reproducer. As noted in the upstream bug, the observed change in behavior is likely due to a deliberate performance improvement which decreases arena contention, but increases the number of arenas. Sumeet, do you agree to RH to close your bug due Insufficient data? yes the observed change in behavior is due to a deliberate performance improvement. In glibc the use arenas improves concurrent performance (by design). But this leads to two problems. first of all the application memory footprint significantly increases. this would have been fine if this were the only issue. second, within an arena, glibc does not return memory to the kernel as you would expect. it request more memory when significant parts of the arena are free (and could be potentially be reusable/available). The above two together leads to an application to OOM, compared to older version of glibc which did not have this performance improvement. hence it appears as a regression or bug to may of our users. we have instrumented and profiled our application and tested it with different allocators (jemalloc in particular). for the exact same workload, using jemalloc and the memory use is flat, suggesting something in glibc is just not reusing memory when it should. This suggests that the algorithm that is used to request more memory for a given arena is flawed, memory is requested from the OS even when there is memory available/free. this in conjunction with the performance improvement which creates more arenas results in OOMs. I am not sure how to address this. clearly this must come from upstream, but resolving this requires observing and profiling an application over days. A standalone reproducer was attempted, but the glibc/upstream required it be produced with the latest version of glibc which most of our customers don't use. |