Bug 1255506 - major performance problem in glibc malloc _int_free
major performance problem in glibc malloc _int_free
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: glibc (Show other bugs)
6.7
All Linux
unspecified Severity high
: rc
: ---
Assigned To: Carlos O'Donell
qe-baseos-tools
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-20 15:00 EDT by Paulo Andrade
Modified: 2016-11-24 07:11 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-08-25 12:11:53 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test.c (2.58 KB, text/plain)
2015-08-20 15:04 EDT, Paulo Andrade
no flags Details
test.c (2.61 KB, text/plain)
2015-08-21 14:48 EDT, Paulo Andrade
no flags Details
test.c (3.66 KB, text/plain)
2015-08-21 16:43 EDT, Paulo Andrade
no flags Details
glibc-arena.patch (1.66 KB, patch)
2015-09-29 15:14 EDT, Paulo Andrade
no flags Details | Diff
glibc-simpler-arena.patch (716 bytes, patch)
2015-09-29 15:16 EDT, Paulo Andrade
no flags Details | Diff
0001-Avoid-corruption-of-free_list.patch (1.74 KB, patch)
2015-10-02 14:15 EDT, Paulo Andrade
no flags Details | Diff

  None (edit)
Description Paulo Andrade 2015-08-20 15:00:48 EDT
To compare with jemalloc, either install the
epel package, build it from source from
https://github.com/jemalloc/jemalloc, or
use the attached binary (from customer).

Using a new test case, to allow rebuilding
from source, I see on rhel7:

---8<---
$ perf record ./test 1000 10000 1000
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 9 seconds, 524076 miliseconds
Sleep 10 seconds and run again

^C
...
$ perf report | sed -e '/^#/d' -e 's/[[:space:]]*$//' | head -10
    25.39%     test  [kernel.kallsyms]   [k] _raw_spin_lock
     7.70%     test  libc-2.17.so        [.] _int_free
     5.82%     test  libc-2.17.so        [.] _int_malloc
     3.19%     test  libc-2.17.so        [.] malloc
     2.64%     test  [kernel.kallsyms]   [k] futex_wait_setup
     2.31%     test  libc-2.17.so        [.] __lll_lock_wait_private
     2.27%     test  libc-2.17.so        [.] malloc_consolidate
     2.08%     test  [kernel.kallsyms]   [k] try_to_wake_up
     1.92%     test  [kernel.kallsyms]   [k] futex_wake
     1.90%     test  libc-2.17.so        [.] free

$ LD_PRELOAD=/usr/lib64/libjemalloc.so.1 perf record ./test 1000 10000 1000
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 0 seconds, 840462 miliseconds
Sleep 10 seconds and run again

^C
..
$ perf report | sed -e '/^#/d' -e 's/[[:space:]]*$//' | head -10
    15.38%     test  libjemalloc.so.1    [.] free
    11.20%     test  libjemalloc.so.1    [.] malloc
     4.98%     test  ld-2.17.so          [.] __tls_get_addr
     2.43%     test  [kernel.kallsyms]   [k] clear_page_c_e
     1.98%     test  libjemalloc.so.1    [.] 0x0000000000026838
     1.80%     test  [kernel.kallsyms]   [k] smp_call_function_many
     1.75%     test  test                [.] run
     1.46%     test  [kernel.kallsyms]   [k] page_fault
     1.32%     test  libjemalloc.so.1    [.] 0x000000000000d580
     1.29%     test  test                [.] cleanup
---8<---
Comment 1 Paulo Andrade 2015-08-20 15:04:18 EDT
Created attachment 1065358 [details]
test.c

$ gcc -pthread test.c -o test

This test case mimics the behaviour of the, binary
only, test case from the user, in C++, linked to
proprietary libraries.
Comment 3 Carlos O'Donell 2015-08-20 16:47:29 EDT
(In reply to Paulo Andrade from comment #0)
> $ perf record ./test 1000 10000 1000
> ThreadPairs=1000; NumElements=10000; AllocationSize=1000
> 1000 threads finished. Spent time: 9 seconds, 524076 miliseconds
> Sleep 10 seconds and run again
> 
> ^C
> ...
> $ perf report | sed -e '/^#/d' -e 's/[[:space:]]*$//' | head -10
>     25.39%     test  [kernel.kallsyms]   [k] _raw_spin_lock
>      7.70%     test  libc-2.17.so        [.] _int_free
>      5.82%     test  libc-2.17.so        [.] _int_malloc
>      3.19%     test  libc-2.17.so        [.] malloc
>      2.64%     test  [kernel.kallsyms]   [k] futex_wait_setup
>      2.31%     test  libc-2.17.so        [.] __lll_lock_wait_private
>      2.27%     test  libc-2.17.so        [.] malloc_consolidate
>      2.08%     test  [kernel.kallsyms]   [k] try_to_wake_up
>      1.92%     test  [kernel.kallsyms]   [k] futex_wake
>      1.90%     test  libc-2.17.so        [.] free

Are you reporting a regression against a baseline?

The above is relatively normal given what glibc's allocator is attempting to do. The general purpose allocator aggressively consolidates on free in an attempt to return memory back to the OS or make it available for subsequent allocations. This has a performance penalty that is paid at free().

The design parameters of other allocators are different than the general purpose allocator. Allocators like tcmalloc or jemalloc don't generally consider the global memory load on the system. Those allocators are used on a special per-application basis, and rarely return memory to the OS unless absolutely necessary.

The above trace is normal for rhel6, rhel7, and upstream glibc (just tested), which are all using the general purpose allocator. During free we must take the arena lock in order to consolidate the recently freed chunk with the other chunks, this yields a larger chunk which can be used to service later allocations which are larger.

The problem that the test case exposes is: What if you have thousands of threads freeing at the same time? Then in that case you get contention during free. This is simply a side-effect of the general allocators decision to keep total system memory load low as a global optimization.

We can use this test case as input into tunable parameters for the allocator, particularly with respect to aggressive consolidation during free. Until then your application may have specific needs that are only met by special purpose allocators like jemalloc, tcmalloc, hoard, nedmalloc, or lockless.

Does that answer your question?
Comment 4 Paulo Andrade 2015-08-20 17:03:16 EDT
Hi Carlos,

I told Gil, the person reporting the problem, about this
bug report. So Gil can also give some feedback.

The test case is just a way to make the "problem" more
visible. The report is that when updating from rhel5 to
rhel6, their software changed from 20-30% average cpu
usage to 99% usage all the time. So, the software uses
a pattern similar to what happens in the test case.

Maybe the bug should be changed to a RFE, to have an
environment variable to make glibc malloc behave more
like malloc variants regarding to free?
Comment 5 Siddhesh Poyarekar 2015-08-20 22:43:47 EDT
There are two bottlenecks this test may be hitting.  With a 1000 threads, it is surely seeing a lot of contention between threads for malloc and free.  The other bottleneck is the consolidation that Carlos mentioned, which happens not only during free, but also during malloc.

However, I am not sure if malloc/free itself is your concern if what you're looking at is an explanation for the high CPU usage.  I'd look at trying to figure out the reason for the _raw_spin_lock in the kernel, which takes up 25% of the execution time.  If you want to throw some wild guesses, then try tweaking environment variables like MALLOC_TRIM_THRESHOLD_ to a high enough value to reduce instances of giving address space back to the system, thus reducing the need to context switch into the kernel.  See the libc manual (info libc) for more environment variables that you can use to tweak glibc malloc behaviour.
Comment 6 Carlos O'Donell 2015-08-21 10:00:08 EDT
(In reply to Siddhesh Poyarekar from comment #5)
> There are two bottlenecks this test may be hitting.  With a 1000 threads, it
> is surely seeing a lot of contention between threads for malloc and free. 
> The other bottleneck is the consolidation that Carlos mentioned, which
> happens not only during free, but also during malloc.
> 
> However, I am not sure if malloc/free itself is your concern if what you're
> looking at is an explanation for the high CPU usage.  I'd look at trying to
> figure out the reason for the _raw_spin_lock in the kernel, which takes up
> 25% of the execution time.  If you want to throw some wild guesses, then try
> tweaking environment variables like MALLOC_TRIM_THRESHOLD_ to a high enough
> value to reduce instances of giving address space back to the system, thus
> reducing the need to context switch into the kernel.  See the libc manual
> (info libc) for more environment variables that you can use to tweak glibc
> malloc behaviour.

I've tried several combinations of MALLOC_TRIM_THRESHOLD_, MALLOC_MMAP_MAX_, MALLOC_MMAP_THRESHOLD_, but mostly you end up trading one bad performance for another. In the latter you make a lot of mmap calls, eventually run out of VA, and then fallback to sbrk anyway (usually after ~650000 mmaps).

I also just tried a modified _int_free that spins in userspace, and it gets slightly better performance, but not much more than ~3% better.

The glibc malloc certainly doesn't use _raw_spin_lock in any direct way. In fact there are instance where I can run the test and it takes 3s, and sometimes it takes 6s, with the latter showing 25% _raw_spin_lock usage. It just matters how the threads get scheduled with respect to their needs and lock ordering, so you can get lucky and get back 50% of the performance. It *looks* like _raw_spin_lock is part of futex_wake in this case, but we don't know, we just have profile data. It would explain why the backoff spin before lock buys only 3%, the real cost is the thread waking the futex as part of the unlock?

The RHEL5 allocator was much much simpler and therefore performed significantly worse in the average threaded workload. In the RHEL6 era we switched entirely to the newer ptmalloc2 code, so there is that difference, but you also have a kernel difference.

What I'd like you to try is this: Take your test case, compile it -static on RHEL5, run it, measure performance, then copy it to RHEL6 and run it again, and measure performance (apples to apples, same runtime, different kernel)?

Thoughts?

The kernel is a big factor here.
Comment 7 Carlos O'Donell 2015-08-21 10:01:52 EDT
Then the next test is build on RHEL6 with -static, run, measure performance, and then copy that binary to RHEL5, and run it there (it should still run with the new runtime and the old kernel, but I'm not entirely sure).
Comment 8 Paulo Andrade 2015-08-21 12:04:58 EDT
I did some extra tests, with static binaries.

Built statically on rhel5, and run on rhel6:
======
[root@ceph10 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.5 (Santiago)
[root@ceph10 ~]# rpm -q glibc kernel
glibc-2.12-1.163.el6.x86_64
kernel-2.6.32-554.el6.x86_64

[root@ceph10 ~]# file test
test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped

[root@ceph10 ~]# perf record ./test 1000 10000 1000
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 22 seconds, 211653 miliseconds
Sleep 10 seconds and run again

^C[ perf record: Woken up 42 times to write data ]
[ perf record: Captured and wrote 11.088 MB perf.data (~484439 samples) ]
[root@ceph10 ~]# perf report | sed -e '/^#/d' -e 's/[[:space:]]*$//' | head -10
     9.55%     test  test               [.] _int_malloc
     6.52%     test  test               [.] _int_free
     6.31%     test  test               [.] malloc
     4.25%     test  [kernel.kallsyms]  [k] __bitmap_empty
     4.12%     test  [kernel.kallsyms]  [k] flush_tlb_others_ipi
     3.08%     test  [kernel.kallsyms]  [k] native_write_msr_safe
     2.95%     test  [kernel.kallsyms]  [k] _spin_lock
     2.85%     test  [kernel.kallsyms]  [k] clear_page_c_e
     2.84%     test  [kernel.kallsyms]  [k] page_fault
     2.38%     test  [kernel.kallsyms]  [k] rwsem_down_failed_common

Running without perf on rhel6, the rhel5 static binary:
[root@ceph10 ~]# ./test 1000 10000 1000
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 15 seconds, 569992 miliseconds
Sleep 10 seconds and run again

^C
======

Now testing again, with statically built on rhel6:
======
[root@ceph10 ~]# gcc -o test test.c -pthread -static
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../lib64/libpthread.a(libpthread.o): In function `sem_open':
(.text+0x77ad): warning: the use of `mktemp' is dangerous, better use `mkstemp'

[root@ceph10 ~]# ./test 1000 10000 1000
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 30 seconds, 193343 miliseconds
Sleep 10 seconds and run again

^C
[root@ceph10 ~]# perf report | sed -e '/^#/d' -e 's/[[:space:]]*$//' | head -10
     6.31%     test  test               [.] __correctly_grouped_prefixmb
     4.25%     test  [kernel.kallsyms]  [k] __bitmap_empty
     4.12%     test  [kernel.kallsyms]  [k] flush_tlb_others_ipi
     3.83%     test  test               [.] __libc_start_main
     3.08%     test  [kernel.kallsyms]  [k] native_write_msr_safe
     2.95%     test  [kernel.kallsyms]  [k] _spin_lock
     2.86%     test  test               [.] abort
     2.85%     test  [kernel.kallsyms]  [k] clear_page_c_e
     2.84%     test  [kernel.kallsyms]  [k] page_fault
     2.82%     test  test               [.] __cxa_atexit
======


Now, rhel6 static binary on rhel5:
======
[root@ceph15 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.11 (Tikanga)
[root@ceph15 ~]# rpm -q glibc kernel
glibc-2.5-123
glibc-2.5-123
kernel-2.6.18-398.el5

[root@ceph15 ~]# ./test 1000 10000 1000
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 17 seconds, 907153 miliseconds
Sleep 10 seconds and run again

======

Again, on rhel5, with static rhel6 binary, with basic oprofile tests:
=====
[root@ceph15 ~]# cat oprofile.sh 
#!/bin/sh
trap "echo Interrupted" INT
VMLINUX=$(rpm -ql kernel-debuginfo | grep vmlinux)
opcontrol --vmlinux=$VMLINUX
# reset counters
opcontrol --reset
# start counters
opcontrol --start
$@
opcontrol --dump
opcontrol --shutdown

[root@ceph15 ~]# ./oprofile.sh ./test 1000 10000 1000
Using default event: CPU_CLK_UNHALTED:100000:0:1:1
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 13 seconds, 337492 miliseconds
Sleep 10 seconds and run again

Interrupted
Stopping profiling.
Killing daemon.

[root@ceph15 ~]# opreport 
CPU: Intel Architectural Perfmon, speed 1200 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
  samples|      %|
------------------
    41061 95.2935 vmlinux
     1166  2.7060 test
      360  0.8355 libc-2.5.so
      160  0.3713 bash
       96  0.2228 oprofiled
       71  0.1648 oprofile
       35  0.0812 ld-2.5.so
       28  0.0650 ehci_hcd
       21  0.0487 jbd
[...]

[root@ceph15 ~]# opgprof ./test
[root@ceph15 ~]# gprof -p ./test
Flat profile:

Each sample counts as 1 samples.
  %   cumulative   self              self     total           
 time   samples   samples    calls  T1/call  T1/call  name    
 39.02    455.00   455.00                             _int_malloc
 21.70    708.00   253.00                             malloc
 14.24    874.00   166.00                             mprotect
  7.12    957.00    83.00                             run
  6.86   1037.00    80.00                             _int_free
  6.86   1117.00    80.00                             grow_heap
  1.03   1129.00    12.00                             getpagesize
  0.94   1140.00    11.00                             cfree
  0.43   1145.00     5.00                             munmap
  0.34   1149.00     4.00                             arena_get2
  0.26   1152.00     3.00                             _dl_deallocate_tls
  0.26   1155.00     3.00                             pthread_join
  0.17   1157.00     2.00                             __deallocate_stack
  0.17   1159.00     2.00                             cleanup
  0.17   1161.00     2.00                             main
  0.09   1162.00     1.00                             _IO_new_file_xsputn
  0.09   1163.00     1.00                             __lll_unlock_wake_private
  0.09   1164.00     1.00                             malloc_consolidate
  0.09   1165.00     1.00                             new_heap
  0.09   1166.00     1.00                             pthread_create
[...]
[root@ceph15 ~]# opgprof /usr/lib/debug/lib/modules/2.6.18-406.el5/vmlinux
[root@ceph15 ~]# gprof -p /usr/lib/debug/lib/modules/2.6.18-406.el5/vmlinux|head -20
Flat profile:

Each sample counts as 1 samples.
  %   cumulative   self              self     total           
 time   samples   samples    calls  T1/call  T1/call  name    
 47.12  19346.00 19346.00                             acpi_processor_power_activate
  3.59  20819.00  1473.00                             schedule
  2.45  21827.00  1008.00                             plist_add
  2.30  22771.00   944.00                             free_pages_bulk
  1.96  23575.00   804.00                             reschedule_interrupt
  1.94  24371.00   796.00                             unmap_vmas
  1.73  25080.00   709.00                             find_busiest_group
  1.69  25772.00   692.00                             page_remove_rmap
  1.36  26329.00   557.00                             acpi_processor_idle_simple
  1.35  26882.00   553.00                             free_hot_cold_page
  1.15  27355.00   473.00                             acpi_table_print
  1.14  27823.00   468.00                             cpuset_update_task_memory_state
  1.01  28237.00   414.00                             page_fault
  1.01  28650.00   413.00                             clear_page
  0.93  29033.00   383.00                             release_pages
======

Hopefully the extra verbosity above will not make things confusing :)
Comment 9 Paulo Andrade 2015-08-21 12:16:09 EDT
Combination that was not tested in previous comment.
Rhel5 static binary test on rhel5:

======
[root@ceph15 ~]# gcc -o test test.c -pthread -static
[root@ceph15 ~]# ./test 1000 10000 1000
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 12 seconds, 91237 miliseconds
Sleep 10 seconds and run again

[root@ceph15 ~]# ./oprofile.sh ./test 1000 10000 1000
Using default event: CPU_CLK_UNHALTED:100000:0:1:1
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.
ThreadPairs=1000; NumElements=10000; AllocationSize=1000
1000 threads finished. Spent time: 19 seconds, 184624 miliseconds
Sleep 10 seconds and run again

Interrupted
Stopping profiling.
Killing daemon.
[root@ceph15 ~]# opgprof ./test
[root@ceph15 ~]# gprof -p ./test
Flat profile:

Each sample counts as 1 samples.
  %   cumulative   self              self     total           
 time   samples   samples    calls  T1/call  T1/call  name    
 38.44    472.00   472.00                             _int_malloc
 20.36    722.00   250.00                             malloc
 12.79    879.00   157.00                             mprotect
  8.63    985.00   106.00                             _int_free
  7.74   1080.00    95.00                             run
  7.08   1167.00    87.00                             grow_heap
  1.14   1181.00    14.00                             getpagesize
  0.65   1189.00     8.00                             arena_get2
  0.49   1195.00     6.00                             cleanup
  0.49   1201.00     6.00                             main
  0.49   1207.00     6.00                             munmap
  0.41   1212.00     5.00                             malloc_consolidate
  0.33   1216.00     4.00                             new_heap
  0.24   1219.00     3.00                             pthread_create
  0.16   1221.00     2.00                             __deallocate_stack
  0.16   1223.00     2.00                             cfree
  0.16   1225.00     2.00                             pthread_join
  0.08   1226.00     1.00                             _IO_file_doallocate
  0.08   1227.00     1.00                             __pthread_enable_asynccancel
  0.08   1228.00     1.00                             _dl_deallocate_tls
...

[root@ceph15 ~]# opgprof /usr/lib/debug/lib/modules/2.6.18-406.el5/vmlinux
[root@ceph15 ~]# gprof -p /usr/lib/debug/lib/modules/2.6.18-406.el5/vmlinux|head -20
Flat profile:

Each sample counts as 1 samples.
  %   cumulative   self              self     total           
 time   samples   samples    calls  T1/call  T1/call  name    
 41.54  16064.00 16064.00                             acpi_processor_power_activate
  3.72  17504.00  1440.00                             plist_add
  3.61  18900.00  1396.00                             free_pages_bulk
  3.59  20289.00  1389.00                             schedule
  3.01  21454.00  1165.00                             unmap_vmas
  2.51  22426.00   972.00                             page_remove_rmap
  1.91  23164.00   738.00                             free_hot_cold_page
  1.85  23878.00   714.00                             reschedule_interrupt
  1.61  24499.00   621.00                             find_busiest_group
  1.50  25078.00   579.00                             release_pages
  1.29  25576.00   498.00                             acpi_processor_idle_simple
  1.19  26038.00   462.00                             cpuset_update_task_memory_state
  1.03  26438.00   400.00                             acpi_table_print
  0.99  26820.00   382.00                             page_fault
  0.91  27171.00   351.00                             clear_page
======
Comment 10 Carlos O'Donell 2015-08-21 14:38:24 EDT
In summary:

rhel6 static binary on rhel6 was 30.193342 seconds.
rhel6 static binary on rhel5 was 17.907153 seconds.
rhel5 static binary on rhel6 was 15.569992 seconds.
rhel5 static binary on rhel5 was 12.912370 seconds.

Please run this and capture statistically significant data (at least 25 runs per). For example on a linux 4.x (upstream) kernel with glibc 2.22 (upstream) the mean is 12.5 seconds with a standard deviation of 2.7 seconds and a standard error of 0.606 seconds, which is close to what you're see in rhel5.

With more data you'll be able to determine if there is any real difference. I expect there is, but I expect the difference is simply in the newer algorithm in rhel6, which we aren't going to change.

However, this discussion might inform us on:

(a) Tunables for rhel6 malloc.
(b) Tunables for rhel7 malloc.

Note: The test writes "miliseconds" but it actually means "microseconds."
Comment 11 Paulo Andrade 2015-08-21 14:48:51 EDT
Created attachment 1065686 [details]
test.c

  Add an extra argument to the test case to make it
easier to run a specific amount of test passes.
  Corrected message to properly say time fraction is
microseconds, not miliseconds.
Comment 12 Paulo Andrade 2015-08-21 16:43:04 EDT
Created attachment 1065722 [details]
test.c

Slightly better output, easier to parse...

Doing rhel5 and rhel6 combinations first as both are
the same hardware. I will try to get an identical
hardware to test on rawhide and/or rhel7.

rhel5 static binary on rhel5:
======
[root@ceph15 ~]# ./test 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  22.730352 seconds
[29] Spent time:   2.701390 seconds
[28] Spent time:   2.038268 seconds
[27] Spent time:   1.907446 seconds
[26] Spent time:   1.736177 seconds
[25] Spent time:   1.760505 seconds
[24] Spent time:   1.567544 seconds
[23] Spent time:   1.651975 seconds
[22] Spent time:   1.696975 seconds
[21] Spent time:   1.558696 seconds
[20] Spent time:   1.584615 seconds
[19] Spent time:   1.346623 seconds
[18] Spent time:   1.521489 seconds
[17] Spent time:   1.566561 seconds
[16] Spent time:   1.519711 seconds
[15] Spent time:   1.496557 seconds
[14] Spent time:   1.545853 seconds
[13] Spent time:   1.566517 seconds
[12] Spent time:   1.394628 seconds
[11] Spent time:   1.353267 seconds
[10] Spent time:   1.450958 seconds
[ 9] Spent time:   1.626107 seconds
[ 8] Spent time:   1.464145 seconds
[ 7] Spent time:   1.555224 seconds
[ 6] Spent time:   1.438600 seconds
[ 5] Spent time:   1.524656 seconds
[ 4] Spent time:   1.303620 seconds
[ 3] Spent time:   1.350697 seconds
[ 2] Spent time:   1.426530 seconds
[ 1] Spent time:   1.341451 seconds
   Average time:   2.290934 seconds
       Min time:   1.303620 seconds
       Max time:  22.730352 seconds

rhel5 static binary on rhel6:
======
[root@ceph10 ~]# scp ceph15.hq.gsslab.rdu.redhat.com:test ./test5
[root@ceph10 ~]# ./test5 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  15.658938 seconds
[29] Spent time:   2.456758 seconds
[28] Spent time:   2.511950 seconds
[27] Spent time:   2.236376 seconds
[26] Spent time:   1.760475 seconds
[25] Spent time:   1.459611 seconds
[24] Spent time:   1.921309 seconds
[23] Spent time:   1.366780 seconds
[22] Spent time:   1.626451 seconds
[21] Spent time:   1.382810 seconds
[20] Spent time:   1.331229 seconds
[19] Spent time:   1.280195 seconds
[18] Spent time:   1.287955 seconds
[17] Spent time:   1.314498 seconds
[16] Spent time:   1.277706 seconds
[15] Spent time:   1.316141 seconds
[14] Spent time:   1.268374 seconds
[13] Spent time:   1.302675 seconds
[12] Spent time:   1.298801 seconds
[11] Spent time:   1.327455 seconds
[10] Spent time:   1.290133 seconds
[ 9] Spent time:   1.308165 seconds
[ 8] Spent time:   1.376440 seconds
[ 7] Spent time:   1.266894 seconds
[ 6] Spent time:   1.300155 seconds
[ 5] Spent time:   1.355953 seconds
[ 4] Spent time:   1.324674 seconds
[ 3] Spent time:   1.268207 seconds
[ 2] Spent time:   1.280307 seconds
[ 1] Spent time:   1.314688 seconds
   Average time:   1.949090 seconds
       Min time:   1.266894 seconds
       Max time:  15.658938 seconds

rhel6 static binary on rhel5:
======
[root@ceph15 ~]# scp ceph10.hq.gsslab.rdu.redhat.com:test ./test6
[root@ceph15 ~]# ./test6 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  20.480393 seconds
[29] Spent time:  10.743429 seconds
[28] Spent time:  11.665014 seconds
[27] Spent time:  12.169247 seconds
[26] Spent time:  12.099680 seconds
[25] Spent time:  10.930062 seconds
[24] Spent time:  11.670270 seconds
[23] Spent time:  10.928671 seconds
[22] Spent time:  12.113895 seconds
[21] Spent time:  11.133409 seconds
[20] Spent time:  11.945694 seconds
[19] Spent time:  11.910168 seconds
[18] Spent time:  12.114059 seconds
[17] Spent time:  12.089180 seconds
[16] Spent time:  10.946918 seconds
[15] Spent time:  10.857889 seconds
[14] Spent time:  10.935703 seconds
[13] Spent time:  12.018039 seconds
[12] Spent time:  12.091831 seconds
[11] Spent time:  10.992099 seconds
[10] Spent time:  10.938686 seconds
[ 9] Spent time:  12.046167 seconds
[ 8] Spent time:  12.026277 seconds
[ 7] Spent time:  11.685663 seconds
[ 6] Spent time:  12.153392 seconds
[ 5] Spent time:  12.064646 seconds
[ 4] Spent time:  12.043551 seconds
[ 3] Spent time:  10.835416 seconds
[ 2] Spent time:  11.585647 seconds
[ 1] Spent time:  12.145870 seconds
   Average time:  11.912046 seconds
       Min time:  10.743429 seconds
       Max time:  20.480393 seconds

rhel6 static binary on rhel6:
======
[root@ceph10 ~]# ./test 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  17.157231 seconds
[29] Spent time:  22.458463 seconds
[28] Spent time:  22.938226 seconds
[27] Spent time:  23.201313 seconds
[26] Spent time:  22.878378 seconds
[25] Spent time:  22.917504 seconds
[24] Spent time:  23.044797 seconds
[23] Spent time:  22.783037 seconds
[22] Spent time:  22.938767 seconds
[21] Spent time:  22.742818 seconds
[20] Spent time:  22.877600 seconds
[19] Spent time:  22.829448 seconds
[18] Spent time:  22.683854 seconds
[17] Spent time:  22.914053 seconds
[16] Spent time:  22.596160 seconds
[15] Spent time:  22.706944 seconds
[14] Spent time:  22.769750 seconds
[13] Spent time:  22.617896 seconds
[12] Spent time:  22.658839 seconds
[11] Spent time:  22.545624 seconds
[10] Spent time:  22.816544 seconds
[ 9] Spent time:  22.773896 seconds
[ 8] Spent time:  22.684363 seconds
[ 7] Spent time:  22.562513 seconds
[ 6] Spent time:  22.670612 seconds
[ 5] Spent time:  22.893113 seconds
[ 4] Spent time:  22.921917 seconds
[ 3] Spent time:  22.643330 seconds
[ 2] Spent time:  22.624289 seconds
[ 1] Spent time:  22.513589 seconds
   Average time:  22.578917 seconds
       Min time:  17.157231 seconds
       Max time:  23.201313 seconds
Comment 13 Carlos O'Donell 2015-08-21 17:13:43 EDT
(In reply to Paulo Andrade from comment #12)
> rhel5 static binary on rhel5:
>    Average time:   2.290934 seconds
>        Min time:   1.303620 seconds
>        Max time:  22.730352 seconds
> 
> rhel5 static binary on rhel6:
>    Average time:   1.949090 seconds
>        Min time:   1.266894 seconds
>        Max time:  15.658938 seconds
> 
> rhel6 static binary on rhel5:
>    Average time:  11.912046 seconds
>        Min time:  10.743429 seconds
>        Max time:  20.480393 seconds
> 
> rhel6 static binary on rhel6:
>    Average time:  22.578917 seconds
>        Min time:  17.157231 seconds
>        Max time:  23.201313 seconds

It looks like there are both kernel and glibc effects cause the performance degradation from rhel5 to rhel6.

We still don't clearly have an answer though.

The static rhel6 binary on rhel5 shows:
 39.02    455.00   455.00                             _int_malloc
 21.70    708.00   253.00                             malloc
 14.24    874.00   166.00                             mprotect

The static rhel5 binary on rhel5 shows:
 38.44    472.00   472.00                             _int_malloc
 20.36    722.00   250.00                             malloc
 12.79    879.00   157.00                             mprotect

The differences in raw percentages for the binaries aren't that big.

Yet the rhel6 static binary takes 9 seconds longer? It's possible for the percentages to stay the same but the amount of time to go up across the board, but that's odd.

I think this comparison is simply not going to lead us anywhere.

A developer would have to sit down with your example, real hardware, and rhel6, and work to get the malloc times down by investigating small changes throughout.

This is certainly in the realm of an RFE, since there is no real bug, the rhel6 code has new features, and a different performance profile that isn't going to change easily or quickly. However, there might be a bug lurking somewhere causing this performance issue, so I don't quite recommend we mark this as an RFE yet.

I'm curious to see the rhel7 results.
Comment 14 Paulo Andrade 2015-08-21 23:15:50 EDT
I could not get a host automatically built to rawhide or f23,
but I can/will manually update the kernel and glibc later.

f22 static binary on f22:
======
[root@ceph6 ~]# cat /etc/redhat-release 
Fedora release 22 (Twenty Two)
[root@ceph6 ~]# rpm -q kernel glibc
kernel-4.1.5-200.fc22.x86_64
glibc-2.21-7.fc22.x86_64
[root@ceph6 ~]# ./test 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:   6.654833 seconds
[29] Spent time:  42.000954 seconds
[28] Spent time:  40.656350 seconds
[27] Spent time:  41.162497 seconds
[26] Spent time:  41.493064 seconds
[25] Spent time:  40.012881 seconds
[24] Spent time:  39.775425 seconds
[23] Spent time:  42.420672 seconds
[22] Spent time:  40.774745 seconds
[21] Spent time:  42.079258 seconds
[20] Spent time:  40.944375 seconds
[19] Spent time:  41.416803 seconds
[18] Spent time:  41.213615 seconds
[17] Spent time:  41.339998 seconds
[16] Spent time:  40.850510 seconds
[15] Spent time:  40.715527 seconds
[14] Spent time:  41.756486 seconds
[13] Spent time:  41.426690 seconds
[12] Spent time:  40.015386 seconds
[11] Spent time:  41.063395 seconds
[10] Spent time:  40.491722 seconds
[ 9] Spent time:  40.850378 seconds
[ 8] Spent time:  42.017397 seconds
[ 7] Spent time:  41.064231 seconds
[ 6] Spent time:  42.106575 seconds
[ 5] Spent time:  40.664243 seconds
[ 4] Spent time:  42.232519 seconds
[ 3] Spent time:  41.310565 seconds
[ 2] Spent time:  40.426628 seconds
[ 1] Spent time:  42.308742 seconds
   Average time:  40.041580 seconds
       Min time:   6.654833 seconds
       Max time:  42.420672 seconds

rhel7 static binary on rhel7:
======
[root@ceph5 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 Beta (Maipo)
[root@ceph5 ~]# rpm -q kernel glibc
kernel-3.10.0-306.0.1.el7.x86_64
glibc-2.17-105.el7.x86_64
[root@ceph5 ~]# ./test 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:   5.944928 seconds
[29] Spent time:  22.425697 seconds
[28] Spent time:  22.075971 seconds
[27] Spent time:  22.115132 seconds
[26] Spent time:  22.143182 seconds
[25] Spent time:  22.118118 seconds
[24] Spent time:  22.015780 seconds
[23] Spent time:  22.122134 seconds
[22] Spent time:  22.046902 seconds
[21] Spent time:  22.010156 seconds
[20] Spent time:  22.059269 seconds
[19] Spent time:  22.131619 seconds
[18] Spent time:  22.020342 seconds
[17] Spent time:  22.003312 seconds
[16] Spent time:  22.066389 seconds
[15] Spent time:  22.025687 seconds
[14] Spent time:  22.005933 seconds
[13] Spent time:  21.896071 seconds
[12] Spent time:  22.044801 seconds
[11] Spent time:  21.886791 seconds
[10] Spent time:  22.008665 seconds
[ 9] Spent time:  21.765397 seconds
[ 8] Spent time:  21.852094 seconds
[ 7] Spent time:  21.872844 seconds
[ 6] Spent time:  21.912621 seconds
[ 5] Spent time:  21.896480 seconds
[ 4] Spent time:  21.869378 seconds
[ 3] Spent time:  21.869905 seconds
[ 2] Spent time:  21.948436 seconds
[ 1] Spent time:  21.896254 seconds
   Average time:  21.468357 seconds
       Min time:   5.944928 seconds
       Max time:  22.425697 seconds


Since rhel5 static binary is the "fastest", only testing it below.

rhel5 static binary on f22:
======
[root@ceph6 ~]# ./test5 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  35.721570 seconds
[29] Spent time:   3.144000 seconds
[28] Spent time:   1.813663 seconds
[27] Spent time:   1.657953 seconds
[26] Spent time:   1.495697 seconds
[25] Spent time:   1.127324 seconds
[24] Spent time:   1.002130 seconds
[23] Spent time:   1.045363 seconds
[22] Spent time:   0.913646 seconds
[21] Spent time:   0.909600 seconds
[20] Spent time:   0.916443 seconds
[19] Spent time:   1.291384 seconds
[18] Spent time:   0.864766 seconds
[17] Spent time:   0.861863 seconds
[16] Spent time:   0.837656 seconds
[15] Spent time:   0.890059 seconds
[14] Spent time:   0.854494 seconds
[13] Spent time:   3.031161 seconds
[12] Spent time:   0.917437 seconds
[11] Spent time:   1.922547 seconds
[10] Spent time:   1.374201 seconds
[ 9] Spent time:   0.829082 seconds
[ 8] Spent time:   0.934746 seconds
[ 7] Spent time:   2.126772 seconds
[ 6] Spent time:   0.875436 seconds
[ 5] Spent time:   2.164686 seconds
[ 4] Spent time:   0.843479 seconds
[ 3] Spent time:   0.895152 seconds
[ 2] Spent time:   0.868875 seconds
[ 1] Spent time:   1.325992 seconds
   Average time:   2.448601 seconds
       Min time:   0.829082 seconds
       Max time:  35.721570 seconds

rhel5 static binary on rhel7:
======
[root@ceph5 ~]#  ./test5 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:   9.558542 seconds
[29] Spent time:   3.633194 seconds
[28] Spent time:   2.466325 seconds
[27] Spent time:   1.706239 seconds
[26] Spent time:   1.471576 seconds
[25] Spent time:   1.306369 seconds
[24] Spent time:   1.318518 seconds
[23] Spent time:   1.194602 seconds
[22] Spent time:   1.119062 seconds
[21] Spent time:   1.248265 seconds
[20] Spent time:   1.342042 seconds
[19] Spent time:   1.150096 seconds
[18] Spent time:   1.071043 seconds
[17] Spent time:   1.179559 seconds
[16] Spent time:   1.060430 seconds
[15] Spent time:   0.992817 seconds
[14] Spent time:   1.020293 seconds
[13] Spent time:   1.073603 seconds
[12] Spent time:   1.112817 seconds
[11] Spent time:   1.129159 seconds
[10] Spent time:   0.986533 seconds
[ 9] Spent time:   1.738998 seconds
[ 8] Spent time:   1.089909 seconds
[ 7] Spent time:   1.013870 seconds
[ 6] Spent time:   1.007646 seconds
[ 5] Spent time:   1.135050 seconds
[ 4] Spent time:   0.959197 seconds
[ 3] Spent time:   1.041139 seconds
[ 2] Spent time:   1.177035 seconds
[ 1] Spent time:   0.984434 seconds
   Average time:   1.576292 seconds
       Min time:   0.959197 seconds
       Max time:   9.558542 seconds
Comment 15 Paulo Andrade 2015-08-22 00:05:55 EDT
f22 "converted" to f23.

f23 static binary on f23:
======
[root@ceph6 ~]# rpm -q glibc kernel
glibc-2.21.90-21.fc23.x86_64
kernel-4.2.0-0.rc5.git0.2.fc23.x86_64
[root@ceph6 ~]# gcc test.c -o test -pthread -static
test.c:38:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
 cleanup(thread_t *thread)
 ^
[root@ceph6 ~]# ./test 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  13.870852 seconds
[29] Spent time:  34.157159 seconds
[28] Spent time:  33.880962 seconds
[27] Spent time:  34.684896 seconds
[26] Spent time:  33.277182 seconds
[25] Spent time:  34.499069 seconds
[24] Spent time:  34.612034 seconds
[23] Spent time:  34.364118 seconds
[22] Spent time:  34.673192 seconds
[21] Spent time:  34.717557 seconds
[20] Spent time:  34.449741 seconds
[19] Spent time:  34.392952 seconds
[18] Spent time:  34.968097 seconds
[17] Spent time:  35.025524 seconds
[16] Spent time:  34.018016 seconds
[15] Spent time:  33.946447 seconds
[14] Spent time:  34.389165 seconds
[13] Spent time:  34.541654 seconds
[12] Spent time:  34.205828 seconds
[11] Spent time:  34.351176 seconds
[10] Spent time:  34.764020 seconds
[ 9] Spent time:  34.361421 seconds
[ 8] Spent time:  34.769760 seconds
[ 7] Spent time:  35.047672 seconds
[ 6] Spent time:  33.926975 seconds
[ 5] Spent time:  34.381010 seconds
[ 4] Spent time:  34.259990 seconds
[ 3] Spent time:  35.000876 seconds
[ 2] Spent time:  34.363796 seconds
[ 1] Spent time:  34.753851 seconds
   Average time:  33.755197 seconds
       Min time:  13.870852 seconds
       Max time:  35.047672 seconds

f23 converted to rawhide (quite impressive BTW).

rawhide static binary on rawhide:
======
[root@ceph6 ~]# rpm -q glibc kernel
glibc-2.22.90-3.fc24.x86_64
kernel-4.2.0-0.rc7.git2.1.fc24.x86_64
[root@ceph6 ~]# gcc -o test test.c -pthread -static
test.c:38:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
 cleanup(thread_t *thread)
 ^
[root@ceph6 ~]# ./test5 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  18.639154 seconds
[29] Spent time:   2.308003 seconds
[28] Spent time:   1.916355 seconds
[27] Spent time:   2.032811 seconds
[26] Spent time:   1.001016 seconds
[25] Spent time:   1.320737 seconds
[24] Spent time:   0.827509 seconds
[23] Spent time:   1.590169 seconds
[22] Spent time:   0.920391 seconds
[21] Spent time:   2.124993 seconds
[20] Spent time:   0.846258 seconds
[19] Spent time:   0.823557 seconds
[18] Spent time:   0.824612 seconds
[17] Spent time:   0.894913 seconds
[16] Spent time:   2.190311 seconds
[15] Spent time:   0.990620 seconds
[14] Spent time:   0.875947 seconds
[13] Spent time:   2.551606 seconds
[12] Spent time:   0.826549 seconds
[11] Spent time:   0.902446 seconds
[10] Spent time:   1.103954 seconds
[ 9] Spent time:   0.880703 seconds
[ 8] Spent time:   0.870502 seconds
[ 7] Spent time:   1.949845 seconds
[ 6] Spent time:   0.900315 seconds
[ 5] Spent time:   0.903792 seconds
[ 4] Spent time:   1.950597 seconds
[ 3] Spent time:   0.888902 seconds
[ 2] Spent time:   0.893260 seconds
[ 1] Spent time:   2.048023 seconds
   Average time:   1.893292 seconds
       Min time:   0.823557 seconds
       Max time:  18.639154 seconds
Comment 16 Siddhesh Poyarekar 2015-08-24 01:10:22 EDT
OK, here's a wild guess - can you set MALLOC_ARENA_MAX to 1000 (i.e. same as the number of threads) and see what difference it makes for the rhel-6/rhel-7/fedora binary?

The guess is that rhel-5 code simply spawns off a new arena whenever it detects contention and reaches a point where it has an arena for each of the 1000 threads.  That is why it takes longer for the first run, but is very fast in subsequent runs.  The rhel-5 code does not have a maximum for the number of arenas IIRC.

rhel-6 on the other hand will limit the number of arenas to the MALLOC_ARENA_MAX value, which is 8xnumber of cores by default for 64-bit.
Comment 17 Paulo Andrade 2015-08-24 08:31:12 EDT
Hi Siddesh,
Indeed it does exactly as you guessed. It appears to be
exponential. I tested several values up to 200, and had
only marginal gain, but with 1 arena per thread, it runs
like the reported rhel5 and rawhide on rhel6 and rhel7.

(rhel6 started as rhel6.5, but updated to latest kernel
and glibc).

static rhel6 binary on rhel6:
======
[root@ceph10 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.5 (Santiago)
[root@ceph10 ~]# MALLOC_ARENA_MAX=1000 ./test 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  39.698830 seconds
[29] Spent time:   0.289303 seconds
[28] Spent time:   0.266758 seconds
[27] Spent time:   0.247873 seconds
[26] Spent time:   0.253778 seconds
[25] Spent time:   0.246414 seconds
[24] Spent time:   0.253071 seconds
[23] Spent time:   0.253006 seconds
[22] Spent time:   0.257633 seconds
[21] Spent time:   0.251150 seconds
[20] Spent time:   0.249118 seconds
[19] Spent time:   0.247551 seconds
[18] Spent time:   0.249857 seconds
[17] Spent time:   0.260324 seconds
[16] Spent time:   0.255005 seconds
[15] Spent time:   0.258537 seconds
[14] Spent time:   0.270087 seconds
[13] Spent time:   0.259674 seconds
[12] Spent time:   0.248127 seconds
[11] Spent time:   0.260382 seconds
[10] Spent time:   0.285116 seconds
[ 9] Spent time:   0.264231 seconds
[ 8] Spent time:   0.275049 seconds
[ 7] Spent time:   0.247222 seconds
[ 6] Spent time:   0.247061 seconds
[ 5] Spent time:   0.263296 seconds
[ 4] Spent time:   0.252353 seconds
[ 3] Spent time:   0.248094 seconds
[ 2] Spent time:   0.245643 seconds
[ 1] Spent time:   0.252830 seconds
   Average time:   1.571927 seconds
       Min time:   0.245643 seconds
       Max time:  39.698830 seconds

static rhel7 binary on rhel7:
======
[root@ceph5 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 Beta (Maipo)
[root@ceph5 ~]#  MALLOC_ARENA_MAX=1000 ./test 1000 10000 1000 30
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=30
[30] Spent time:  11.532985 seconds
[29] Spent time:   0.210103 seconds
[28] Spent time:   0.207964 seconds
[27] Spent time:   0.207277 seconds
[26] Spent time:   0.207676 seconds
[25] Spent time:   0.208346 seconds
[24] Spent time:   0.207832 seconds
[23] Spent time:   0.207473 seconds
[22] Spent time:   0.209500 seconds
[21] Spent time:   0.208177 seconds
[20] Spent time:   0.211084 seconds
[19] Spent time:   0.207228 seconds
[18] Spent time:   0.207065 seconds
[17] Spent time:   0.208569 seconds
[16] Spent time:   0.206819 seconds
[15] Spent time:   0.206759 seconds
[14] Spent time:   0.210272 seconds
[13] Spent time:   0.208458 seconds
[12] Spent time:   0.208612 seconds
[11] Spent time:   0.206611 seconds
[10] Spent time:   0.207909 seconds
[ 9] Spent time:   0.209637 seconds
[ 8] Spent time:   0.206801 seconds
[ 7] Spent time:   0.207348 seconds
[ 6] Spent time:   0.207319 seconds
[ 5] Spent time:   0.207329 seconds
[ 4] Spent time:   0.208122 seconds
[ 3] Spent time:   0.206469 seconds
[ 2] Spent time:   0.206239 seconds
[ 1] Spent time:   0.208249 seconds
   Average time:   0.585487 seconds
       Min time:   0.206239 seconds
       Max time:  11.532985 seconds
Comment 18 Carlos O'Donell 2015-08-24 08:38:22 EDT
(In reply to Paulo Andrade from comment #17)
> Hi Siddesh,
> Indeed it does exactly as you guessed. It appears to be
> exponential. I tested several values up to 200, and had
> only marginal gain, but with 1 arena per thread, it runs
> like the reported rhel5 and rawhide on rhel6 and rhel7.

I didn't see this behaviour because I considered the thread startup to be costs involved in the test. I disabled test looping and took only the first result, running the test over and over again. My apologies for the confusion.
Comment 19 Siddhesh Poyarekar 2015-08-24 09:07:38 EDT
(In reply to Paulo Andrade from comment #17)
> Hi Siddesh,
> Indeed it does exactly as you guessed. It appears to be
> exponential. I tested several values up to 200, and had
> only marginal gain, but with 1 arena per thread, it runs
> like the reported rhel5 and rawhide on rhel6 and rhel7.

Thank you for confirming that.  Anything less than an arena per thread implies contention and that will always be the highest cost with so many threads.  I am surprised that rawhide behaves well out of the box; looks like we did something awesome and did not even know it!  Or maybe the timing characteristics on the rawhide kernel just happen to be such that contention is somehow minimized.  I don't remember any improvements directly in the allocator that may have done this.

I think you have your resolution for the problem though - instruct the user to set MALLOC_ARENA_MAX to a high enough value that allows each thread to have its own arena.  So if the application has a maximum of 2000 threads, set MALLOC_ARENA_MAX to 2000.  It will then allocate arenas as and when needed, up to the maximum of 2000.

It will bloat VSZ, but that shouldn't matter to them since the rhel-5 allocator should have resulted in a similar size for VSZ.
Comment 20 Carlos O'Donell 2015-08-24 23:06:31 EDT
(In reply to Siddhesh Poyarekar from comment #19)
> (In reply to Paulo Andrade from comment #17)
> > Hi Siddesh,
> > Indeed it does exactly as you guessed. It appears to be
> > exponential. I tested several values up to 200, and had
> > only marginal gain, but with 1 arena per thread, it runs
> > like the reported rhel5 and rawhide on rhel6 and rhel7.
> 
> Thank you for confirming that.  Anything less than an arena per thread
> implies contention and that will always be the highest cost with so many
> threads.  I am surprised that rawhide behaves well out of the box; looks
> like we did something awesome and did not even know it!  Or maybe the timing
> characteristics on the rawhide kernel just happen to be such that contention
> is somehow minimized.  I don't remember any improvements directly in the
> allocator that may have done this.
> 
> I think you have your resolution for the problem though - instruct the user
> to set MALLOC_ARENA_MAX to a high enough value that allows each thread to
> have its own arena.  So if the application has a maximum of 2000 threads,
> set MALLOC_ARENA_MAX to 2000.  It will then allocate arenas as and when
> needed, up to the maximum of 2000.
> 
> It will bloat VSZ, but that shouldn't matter to them since the rhel-5
> allocator should have resulted in a similar size for VSZ.

These are actually undocumented interfaces.

So for a start I've submitted a patch to the linux man-pages project to document them in mallopt as the others are documented.

http://marc.info/?l=linux-man&m=144043230615365&w=2

I'll fix the glibc manual next.
Comment 21 Siddhesh Poyarekar 2015-08-25 00:33:45 EDT
(In reply to Carlos O'Donell from comment #20)
> These are actually undocumented interfaces.

Not in RHEL:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/6.0_Release_Notes/

"""
    13.4. GNU C Library (glibc)

    An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores. This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores. 
"""
Comment 22 Carlos O'Donell 2015-08-25 00:49:37 EDT
(In reply to Siddhesh Poyarekar from comment #21)
> (In reply to Carlos O'Donell from comment #20)
> > These are actually undocumented interfaces.
> 
> Not in RHEL:
> 
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/
> html-single/6.0_Release_Notes/

A release note for RHEL 6.0 doesn't count as documentation :-)

The correct reference is here:

https://documentation-devel.engineering.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Developer_Guide/appendix.html

Which lists M_ARENA_MAX and M_ARENA_TEST, but doesn't explain them (see bug https://bugzilla.redhat.com/show_bug.cgi?id=803565), but reference the glibc manual online.

Therefore fixing the glibc manual will fix this also.
Comment 23 Carlos O'Donell 2015-08-25 12:11:53 EDT
In summary the solution is to increase the maximum number of arenas to match the number of threads to reduce contention. The side effect of that is an increase in the memory usage, but also an increase in performance.

Given that this issue is resolved I'm going to mark this CLOSED/NOTABUG. If you want us to do any enhancements please open another bug related to this. The arena limit was designed to prevent applications from consuming more memory by adding more arenas.
Comment 24 Paulo Andrade 2015-09-29 15:14 EDT
Created attachment 1078466 [details]
glibc-arena.patch

This patch was built to experiment with another issue,
of corrupted free_list when multiples threads exit at
the same time, where it makes free_list a cyclic list.
But at the end, it looks like the same problem
(free_list becomes corrupted and cyclic, and only one
arena is used by most threads).

But it has the interesting behaviour of running quite
fast without forcing an arena per thread:

[root@ceph5 ~]# ./test 1000 10000 1000 10
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=10
[10] Spent time:   3.943808 seconds
[ 9] Spent time:   0.271434 seconds
[ 8] Spent time:   0.270598 seconds
[ 7] Spent time:   0.270830 seconds
[ 6] Spent time:   0.271183 seconds
[ 5] Spent time:   0.262546 seconds
[ 4] Spent time:   0.255984 seconds
[ 3] Spent time:   0.254850 seconds
[ 2] Spent time:   0.250589 seconds
[ 1] Spent time:   0.254036 seconds
   Average time:   0.630602 seconds
       Min time:   0.250589 seconds
       Max time:   3.943808 seconds

or if using one arena per thread:

[root@ceph5 ~]# MALLOC_ARENA_MAX=10000 ./test 1000 10000 1000 10
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=10
[10] Spent time:  38.120722 seconds
[ 9] Spent time:   0.306507 seconds
[ 8] Spent time:   0.267070 seconds
[ 7] Spent time:   0.294152 seconds
[ 6] Spent time:   0.272214 seconds
[ 5] Spent time:   0.293910 seconds
[ 4] Spent time:   0.269947 seconds
[ 3] Spent time:   0.297407 seconds
[ 2] Spent time:   0.265813 seconds
[ 1] Spent time:   0.294111 seconds
   Average time:   4.068200 seconds
       Min time:   0.265813 seconds
       Max time:  38.120722 seconds


Now reverting to unpatched rhel6 glibc:
[root@ceph5 ~]# rpm -Uvh rpms/*
Preparando...               ########################################### [100%]
   1:glibc-common           ########################################### [ 11%]
   2:glibc                  ########################################### [ 22%]
   3:glibc-headers          ########################################### [ 33%]
   4:glibc-devel            ########################################### [ 44%]
   5:glibc-debuginfo-common ########################################### [ 56%]
   6:glibc-debuginfo        ########################################### [ 67%]
   7:glibc-static           ########################################### [ 78%]
   8:glibc-utils            ########################################### [ 89%]
   9:nscd                   ########################################### [100%]
[root@ceph5 ~]# ./test 1000 10000 1000 10
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=10
[10] Spent time:  23.868231 seconds
[ 9] Spent time:  29.669367 seconds
[ 8] Spent time:  27.554297 seconds
[ 7] Spent time:  28.324487 seconds
[ 6] Spent time:  28.505810 seconds
[ 5] Spent time:  27.086120 seconds
[ 4] Spent time:  29.426261 seconds
[ 3] Spent time:  28.773775 seconds
[ 2] Spent time:  29.258996 seconds
[ 1] Spent time:  24.990351 seconds
   Average time:  27.745787 seconds
       Min time:  23.868231 seconds
       Max time:  29.669367 seconds
[root@ceph5 ~]# MALLOC_ARENA_MAX=10000 ./test 1000 10000 1000 10
ThreadPairs=1000; NumElements=10000; AllocationSize=1000; NumTests=10
[10] Spent time:  38.204146 seconds
[ 9] Spent time:   0.253042 seconds
[ 8] Spent time:   0.249577 seconds
[ 7] Spent time:   0.242928 seconds
[ 6] Spent time:   0.249131 seconds
[ 5] Spent time:   0.245501 seconds
[ 4] Spent time:   0.247253 seconds
[ 3] Spent time:   0.246819 seconds
[ 2] Spent time:   0.247284 seconds
[ 1] Spent time:   0.245560 seconds
   Average time:   4.043140 seconds
       Min time:   0.242928 seconds
       Max time:  38.204146 seconds
Comment 25 Paulo Andrade 2015-09-29 15:16 EDT
Created attachment 1078467 [details]
glibc-simpler-arena.patch

This one is simpler, and the cost of traversing the
list should be negligible. Avoiding a global pointer
is good because it is very unlikely this patch may
have undesired side effects.
Comment 26 Paulo Andrade 2015-10-02 14:15 EDT
Created attachment 1079485 [details]
0001-Avoid-corruption-of-free_list.patch

Update proposed patch.

Note You need to log in before you can comment on or make changes to this bug.