Bug 799327

Summary: MALLOC_ARENA_MAX=1 does not work in RHEL 6.2
Product: Red Hat Enterprise Linux 6 Reporter: Peter Klotz <peter.klotz>
Component: glibcAssignee: Jeff Law <law>
Status: CLOSED ERRATA QA Contact: qe-baseos-tools-bugs
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: fweimer, mfranc
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 800240 (view as bug list) Environment:
Last Closed: 2012-03-02 16:21:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 800240    
Attachments:
Description Flags
Test program copied from glibc bug report #11261 (http://sourceware.org/bugzilla/show_bug.cgi?id=11261) none

Description Peter Klotz 2012-03-02 13:56:56 UTC
Created attachment 567069 [details]
Test program copied from glibc bug report #11261 (http://sourceware.org/bugzilla/show_bug.cgi?id=11261)

Description of problem:

Our company develops a medical image archive solution involving several multithreaded x86_64 server processes.

We recently discovered that under high load situations memory usage (RSS) increases massively and never returns to its original value.

After intensive testing including tools like valgrind (memcheck, helgrind, drd) we have ruled out a memory leak.

Interestingly, if we use MALLOC_CHECK_=1, our servers behave perfectly normal and do not show such behavior. RSS returns to its original value even after high loads.

I am not very happy, running production servers with a debug option like MALLOC_CHECK_. Under RHEL 5 this is not even an option, since all programs started using MALLOC_CHECK_ initially write the message "malloc: using debugging hooks". This breaks our system.

According to malloc_stats(), MALLOC_CHECK_ reduces the number of arenas to 1.

If I understand it correctly, then MALLOC_ARENA_MAX=1 would be the correct setting for production environments to achieve the same. However setting MALLOC_ARENA_MAX=1 still leads to several arenas used by malloc.

Version-Release number of selected component (if applicable):
glibc-2.12-1.47.el6.x86_64
RHEL 6.2

How reproducible:
Always. 
I borrowed a test program from glibc bug #11261 (http://sourceware.org/bugzilla/show_bug.cgi?id=11261)

Steps to Reproduce:
1. export MALLOC_ARENA_MAX=1
2. gcc -pthread malloc_arena_max.c -o malloc_arena_max
3. ./malloc_arena_max
4. On a 4 core machine the program uses 3 malloc arenas
5. ./malloc_arena_max -x
6. On a 4 core machine the program uses 11 malloc arenas

Actual results:
A program started with "MALLOC_ARENA_MAX=1" uses several malloc arenas.

Expected results:
A program started with "MALLOC_ARENA_MAX=1" uses only one malloc arena.

Additional info:
It seems that our production servers behave a lot like the attached test program, especially when started with option "-x".

Btw. In Arch Linux (glibc 2.15) MALLOC_ARENA_MAX=1 works as expected.

Comment 2 Jeff Law 2012-03-02 16:21:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0058.html

Comment 3 Peter Klotz 2012-03-05 12:04:02 UTC
Thanks for the information about the errata. I was not aware of that. Will there be such an errata von RHEL 5, too? There the bug is also present.

After using the updated glibc, MALLOC_ARENA_MAX works as expected and can no longer be exceeded.

However I am still observing strange behavior. Performing a well defined test against one of our servers shows massive RSS memory usage. When the process is run with MALLOC_CHECK_=3, no such behavior occurs.

No environment variable set (4 CPU cores -> 32 arenas)
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 After startup:
 11334 pkl       20   0  594m  10m 8420 S  0.3  0.1   0:00.70 InboundServer.c
 After test run:
 11334 pkl       20   0 2233m 481m 9100 S  0.3  6.1   0:25.23 InboundServer.c

export MALLOC_ARENA_MAX=1 (one arena due to errata RPM)
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 After startup:
 11472 pkl       20   0  208m 9.8m 8136 S  0.8  0.1   0:00.07 InboundServer.c
 After test run:
 11472 pkl       20   0 3481m 3.2g 9120 S  0.3 41.2   0:57.83 InboundServer.c

export MALLOC_CHECK_=3 (always a single arena)
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 After startup:
 11714 pkl       20   0  209m 9.9m 8128 S  0.3  0.1   0:00.06 InboundServer.c
 After test run:
 11714 pkl       20   0  258m  19m 8872 S  0.0  0.3   0:33.15 InboundServer.c

The test run with "MALLOC_CHECK_=3" is perfectly reproducible. The other two cases lead to different RSS values in each run.

This makes me suspect, that there may be another bug hidden somewhere in glibc that is only triggered, if MALLOC_CHECK_ is not used.

Comment 4 Jeff Law 2012-03-06 05:15:30 UTC
Until recently there hasn't been any customer interest in this issue for Red Hat Enterprise Linux 5.  I can certainly put it on the list of issues to review/prioritize for the next release of Red Hat Enterprise Linux 5.

MALLOC_CHECK=3 uses a completely different allocator which avoids most of the optimizations for the sake of providing an environment which makes debugging malloc problems easier.

Setting M_TRIM_THRESHOLD to zero and rerunning your tests would tell us if it's a problem with free memory at the top of heap not being returned promptly to the system.

Setting M_MMAP_THRESHOLD to a lower value will force more allocations (anything larger than M_MMAP_THRESHOLD) to be made via mmap.  This can often help with fragmentation of the heap as mmaped areas can always be returned while only the free memory between the last live object and the top of the heap can be returned when allocating via the heap.

Setting M_MMAP_MAX to a high value can also help if you've got a large number of live mmapped objects.  Basically M_MMAP_MAX is the maximum number of live objects allocated by mmap at any given time.

You can also use malloc_stats() to get information about where/how memory has been allocated.  Similarly mallinfo can give similar information, but in a form more readily useable by a program rather than a human.

Comment 5 Peter Klotz 2012-03-06 16:15:57 UTC
Hello Jeff

Thank you very much for your hints regarding this problem and for creating the RHEL 5 bug.

The problem I am seeing disappears when setting M_TRIM_THRESHOLD.

What is still strange is the following:

The value of M_TRIM_THRESHOLD in the mallopt() call does not matter. My problem disappers for values 0, 128kB and even 30MB. Since 128kB is the default value of M_TRIM_THRESHOLD I assumed, this would make the massive RSS usage reappear.

Digging through the glibc source I found the following:

Setting M_TRIM_THRESHOLD with mallopt() also sets internal flag mp_.no_dyn_threshold to 1. Initially this flag is always zero.

This is the only code, where this flag is used (malloc/malloc.c):

    if (!mp_.no_dyn_threshold
        && p->size > mp_.mmap_threshold
        && p->size <= DEFAULT_MMAP_THRESHOLD_MAX)
      {
        mp_.mmap_threshold = chunksize (p);
        mp_.trim_threshold = 2 * mp_.mmap_threshold;
      }

Since mp_.no_dyn_threshold is 1 after the mallopt() call, this code is no longer executed. 

Could this flag in fact be the trigger, not the value of M_TRIM_THRESHOLD?

Comment 6 Jeff Law 2012-03-07 18:29:53 UTC
Sigh.  I forgot that tweaking many of the M_* environment variables will turn off the dynamic sbrk/mmap heuristics.  

Basically those heuristics attempt to identify transient vs long lived allocations with the former landing in the heap and the latter in mmap'd space.

It starts with the assumption that < 128Kb allocations are transient and > 128Kb are long lived.  When memory is freed that was allocated via mmap, then the dynamic threshold goes up (with a clamp at 64Mb).

So what's happening is the mmap threshold is being held constant after you set M_TRIM_THRESHOLD.  If that's keeping your process from growing, that's a good indication that you've got a fragmentation issue. 

By keeping the mmap threshold from growing, more allocations are occurring via mmap which are trivially released back to the system.  If the mmap threshold grows, then those allocations will occur via sbrk which are then subject to the restriction that only the free memory beyond the last live allocation can be returned to the system.

It can be worth the effort to spend some time tuning the trim and mmap thresholds, particularly for long running processes.

Comment 7 Peter Klotz 2012-04-02 09:31:18 UTC
Hello Jeff

Sorry for the late response but it took us a while to update a production server to RHEL 5.8 and get our latest binaries there up and running.

Internally I now have two quite different test programs that show memory fragmentation and are "cured" using M_TRIM_THRESHOLD with a value of 128kB. I observe identical behavior in RHEL 5.8 and 6.2. Both test programs extract certain parts of our production server that I suspected to be part of the problem.

To my disappointment the M_TRIM_THRESHOLD fix (with 128kB and even with value 0) does not work in our RHEL 5.8 production server. It still shows growing RSS usage. So there seems to be another aspect of the problem that I was not able to reproduce in my tests so far.

But what is much worse: I am seeing calls blocking in memcpy() for quite some time when a process using the M_TRIM_THRESHOLD runs for a longer time period under high load. Only a few of several thousand client requests end up like this. Even if all clients are stopped, server threads calling memcpy() keep running for hours. 

Then switching back to a process that does not call mallopt() does not help. We must reboot the machine to get sane memory allocation behavior back.

So far this behavior only occurs in our production system (2*Opteron CPU/8 cores, 32GB RAM). I never managed to reproduce anything like that internally but I have to admit that I do not have similar hardware to test.

This is a typical backtrace taken from a single thread in a live process:

Thread 2 (Thread 0x4cc6c940 (LWP 1459)):
#0  0x0000003b0b87c629 in memcpy () from /lib64/libc.so.6
#1  0x0000003b1049c260 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_clone(std::allocator<char> const&, unsigned long) ()
   from /usr/lib64/libstdc++.so.6
#2  0x0000003b1049cb45 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::reserve(unsigned long) () from /usr/lib64/libstdc++.so.6
#3  0x0000003b1049cfaf in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::append(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib64/libstdc++.so.6
#4  0x0000000000dd82e8 in Poco::XML::CharacterData::appendData(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#5  0x0000000000dd9567 in Poco::XML::DOMBuilder::characters(char const*, int, int) ()
#6  0x0000000000df0445 in doContent(XML_ParserStruct*, int, encoding const*, char const*, char const*, char const**, unsigned char) ()
#7  0x0000000000df1404 in contentProcessor(XML_ParserStruct*, char const*, char const*, char const**) ()

Interestingly all backtraces that end in memcpy() look the same although we allocate memory using several more code paths.

Is this memcpy() blocking issue a known side effect of using mallopt() or should this be considered a kernel bug? I assume it has something to do with the kernel since why would only a reboot finally fix the blocking issue.

Comment 8 Jeff Law 2012-04-02 17:58:20 UTC
I'm not aware of any interaction between memcpy & mallopt, nor anything which would cause memcpy to block.  Certainly there's no blocking/locking within memcpy itself.

The only thing I could think of would be if the memory was lazily allocated by the kernel, then real memory used to back the allocation upon receiving a page fault.  I don't know of a way to directly check this, and it's just a theory, I don't know if the kernel actually works this way for mmap'd regions or not.

We might get some clues if you could attach to the blocked process via gdb and do something like

x/10i $pc
i all-r

Then dump /proc/<pid>/maps

That should allow us to determine what address is locked up, and we ought to be able to determine if it's a heap, stack or other address.

Comment 9 Peter Klotz 2012-04-03 21:15:39 UTC
Hello Jeff

The blocking memcpy() issue was in fact a poorly scaling algorithm in our software that was triggered by client requests. This special kind of requests started exactly when I was investigating the fragmentation issue. So I falsely assumed a connection with the mallopt() call.

I hope that by eliminating this algorithm I got rid of another memory fragmentation candidate. The code performed far too many string copy operations.

I will continue to monitor memory usage of our production system.

Comment 10 Peter Klotz 2013-09-09 13:55:38 UTC
Hello Jeff

Finally I have good news. I was able to find and fix one memory leak in our production server. Setting M_TRIM_THRESHOLD to 128kB now limits the memory usage of our production server process to approx. 2GB. This value is reached a few hours after starting the server. Currently the server process is up and running for 4 months.

Thanks a lot for your help!

Regards, Peter.

Comment 11 Jeff Law 2013-09-09 15:25:46 UTC
Peter,

Glad to hear it!