Red Hat Bugzilla – Bug 170249
malloc performance regression?
Last modified: 2012-06-20 12:10:13 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0
Description of problem:
We produce a database server product. Periodically I run benchmark tests
to test the performance of our product. When I did these recently I
noticed an issue. That lead me to conduct a series of further tests. As
a baseline I compiled and ran our server product on Fedora Core 2. The
test machine was configured as
2 x Opteron 246
single 15,000 RPM SCSI disk
Only essential services were running (e.g sendmail and many other
daemons were disabled).
I then ran a 32-user concurrent benchmark test (twice), the average time
I upgraded to Fedora Core 3. Again non-essential services including
SELinux were disabled. I re-ran the benchmark test using the Fedora Core
2 compiled up version of our product. The average time taken was
I upgraded to Fedora Core 4. The average time taken was
I installed Red Hat Enterprise Linux 4 update 1. I compiled up our
product and re-run the benchmark test. The average time taken was
Basically, the newer the Fedora/Red Hat version the worse the
performance. To test out whether malloc was the performance problem I
wrote 2 small malloc programs: malloc (a single thread malloc program),
malloc_thr (a multi-thread malloc program). Both program source files
are attached to this bug item. I compiled the programs as follows
# g++ -O2 malloc.cc -o malloc
# g++ -O2 malloc_thr.cc -lpthread -o malloc_thr
I compiled these programs on Fedora Core 2 and then ran each program
multiple times on Fedora Core 2, Fedora Core 3, Fedora Core 4 and Red
Hat Enterprise Linux 4 Update 1 (on the same machine).
average malloc program times
FC2: 1 min 35 secs
FC3: 1 min 56 secs
FC4: 1 min 39 secs
RH4: 1 min 47 secs
average malloc_thr program times
FC2: 1 min 52 secs
FC3: 2 min 18 secs
FC4: 1 min 59 secs
RH4: 2 min 26 secs
It does appear that malloc performance has regressed in the versions
after Fedora Core 2. I notice in the Fedora Core 3 and Red Hat EL 4
release notes the following text
> The version of glibc provided with Fedora Core 3 performs additional
> internal sanity checks to prevent and detect data corruption as early
> as possible.
I re-ran some tests with 'export MALLOC_CHECK_=0' and saw no performance
difference. These tests when run on the same machine with different
Fedora Core/Red Hat appear to highlight a malloc slowdown.
My question is whether malloc is the cause of my performance regression
on Linux distribution versions after Fedora Core 2. Is there anything I
can do in userland to obtain superior performance in Red Hat EL4 (e.g a
compilation flag). This performance issue leads to Fedora Core 4 being
17% slower than Fedora Core 2 when running our benchmarks.
I am more than willing to run some further tests down here if needed.
Or is it the case that I will just have to live with the slower
performance? It should also be noted that our product benchmark tests
showed Fedora Core 2 to be superior to Windows XP (with default allocator)
and Solaris 10 (with ptmalloc) when run on the same hardware. However,
the more recent Fedora Core versions (3 & 4) and Red Hat EL 4 now trail
Windows XP/Solaris 10 performance wise. I would prefer Red Hat EL 4 to
have Fedora Core 2 like performance.
So this request is not really a proper bug item, it is more an
information gathering excercise (initially) related to performance.
We look forward to any information provided back to us. Thanks for your
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. compile the test test program attached via
# g++ -O2 malloc.cc -o malloc
# g++ -O2 malloc_thr.cc -lpthread -o malloc_thr
2. Run the programs on Fedora Core 2, 3, 4 and Red Hat EL 4
Actual Results: The expected results will appear to show that Fedora Core 2 is fast whilst
the later Fedora and Red Hat versions are slow.
Expected Results: It would be nice to fast malloc performance such as displayed by
Fedora Core 2.
Created attachment 119758 [details]
malloc program text.
Created attachment 119759 [details]
malloc_thr program text.
When you want to test malloc performance, you should be using malloc, not new,
as otherwise you are using new performance and not malloc performance.
E.g. FC4 has completely different STL allocator (by default) from the other
However, I did use the same Fedora Core 2 libstdc++.so.6 in all Fedora
tests, but not on Red Hat EL.
Nevertheless I will test out raw 'malloc' performance. Maybe 'new' could
be the issue. I will get back to you with some numbers soonish, doing
a number of OS installs takes time.
Just letting you know that our database server product is a moderately
large (1-million line) C++ product. Our product does not use STL, our
product has custom Vector, String etc implementations.
Again, I can not state with confidence what is causing the performance
issue, I am just guessing a little that it is malloc.
I converted the original sample programs to C variants.
It looks like single threaded malloc performance is not an issue.
However I did see some issues with the multi-threaded test.
I have attached a C file named malloc_thr.c.
I compiled the program on Fedora Core 2 via,
# gcc -O2 malloc_thr.c -lpthread -o malloc_thr
I ran the program on a 2-cpu Opteron 10 times with the
I upgraded to Fedora Core 3 and re-run the same Fedora Core 2
Upgrade to Fedora Core 4
Install Red Hat EL4 update 1 and run the same Fedora Core 2
RH EL4 times
As you can see it appears that Fedora Core 3 and Red Hat EL 4 are
providing times that appear to be inferior to Fedora Core 2 and 4.
My test creates 10 threads, this is run on a 2-cpu machine. The
run machine is being stressed.
Could my issue actually be kernel scheduling differences between
the various revisions? Are there any differences of note? The fast
Fedora Core 4 times do not marry up with the slow benchmarking with our
product. Overall it appears that I may need to do some more analysis
at my end to better determine the issue I am seeing.
As it stands, my test program highlights an issue with FC3/RH4, however
FC4 does not have that same issue. If FC4 was also slow then I believe
there would be something to investigate. However, there is not.
Maybe this issue should be closed.
Created attachment 119794 [details]
Multi-thread malloc C program
I have had more of a play with this issue and have learnt quite a bit.
The issue highlighted by the benchmarks done on our database server
indicate that the problem is not glibc malloc, or C++ new, or selinux
(etc etc). The slowdown in performance is kernel related.
I'll quickly refresh things. I carried out some benchmark tests against
our database server product. One test creates 24 client processes
(simultanously) on a separate 2-cpu machine (running solaris). The 24
client processes connect up to a our database server (which spawns a
thread per connection) which handles the client load (each client will
be issueing a number of queries). The database server is running on
a 2-cpu Opteron 246 machine with 4GB or RAM with a SCSI 72GB disk.
On Fedora Core 2 handling that 24-user load takes 15 minutes.
On Red Hat EL4 Update 1 handling that 24-user load takes 17 minutes.
On Fedora Core 4 handling that 24-user load takes 18 minutes.
I then installed the Fedora Core 2 kernel onto the Fedora Core 4
installation and rebooted with the FC2 kernel. I re-ran the test,
the result came out as 15 minutes.
Note, all tests were run multiple times.
Something has happened to the kernel between FC2 and FC3. The newer
kernels do not handle extremely high user load as well as the older
kernel. For low user loadings (e.g the number of users and number of cpus
match) there is no issue. Only when the user/thread load is far larger
than the number of processors does an issue arise.
Has something changed with kernel process scheduling?
Is there something I can change environmentally in /proc?
It may be the case that this request should be closed since glibc is not
the culprit (apologies for the red herring). Also, I don't have a sample
program to supply. I may need to hack something together to highlight
this performance regression.
Anyway I guess the point of this latest text is to find out from you
guys whether any of the above text rings a bell. If there is any
information you can provide that would be great.
Yep, the newer kernel is the issue.
I have done quite a bit more testing. This time I compiled up a
whole series of vanilla kernels, from 2.6.5 all the way through
to 2.6.14. Standard x86-64 kernel builds were performed, I did
not fiddle any configuration settings.
Again I conducted our heavy query load test. A 32-user load was
performed on the 2 CPU Opteron box. Note, a query load does not
do any disk writing, however quite a bit of disk reading occurs
(to read index structures from disk to memory).
I conducted each 32-user load 4 times prior to testing the next
kernel. I averaged the times for the test. The following results
The timings observed with the standard kernels mirror the times
observed with the equivalent Fedora Core 2/4 and Red Hat 4
kernels. That is the Red Hat EL 4 2.6.9 kernel produced timings
that were extremely similar to the vanilla 2.6.9 kernel. Hence, I
believe the issue to be with the standard kernel.
The numbers above fall into distinct groups.
2.6.5 and 2.6.6 are basically the same.
Something happened with 2.6.7 to effect performance negatively
(maybe scheduler domains?). 2.6.8 mirrors 2.6.7.
The 2.6.9 kernel (which is also used with Red Hat EL 4) suffered
another noticeable drop off in performance.
The 2.6.10 appears to result in 2.6.7-like performance.
The 2.6.11 suffers a major drop in performance.
Likewise the 2.6.13 suffers from an even more obvious performance
Basically I believe that each of: 2.6.7, 2.6.9, 2.6.11 and 2.6.13
are the releases of interest. Each of which progressively results
in worse performance. Hmmm, every odd numbered release, weird.
The difference in performance between 2.6.5 and 2.6.14 is
startling (a nearly 50% drop in performance).
This feels like a scheduler issue, noting that that the
concurrent 32-user load is trying to be satisfied by a 2 CPU
Note, our test tries to mimic a real world use case. It is not an
artificial test. I believe this test is highlighting a legitimate
issue with the 2.6 series of kernels.
I am more than willing to test out some patches if need be. This
drop off in performance is concerning, it would be great to find
out what is going on here.
P.S. I believe this bug item should be renamed from
'malloc performance regression?' to 'kernel regression'
What is your NUMA setting (on or off). regardless, you might want to try to
pin each of the processes to a particular CPU to normalize for NUMA behavior.
I did a standard Red Hat install.
How do I deduce the current system NUMA status?
How do I toggle between NUMA enabled and disabled?
Pinning our database server to a particular CPU is not
ideal. Our server is a single-process-multi-threaded
instance. Restricting the single process to a single
CPU will result in reduced performance for high
concurrency user loadings of our server (e.g it is
better for the process to use both CPUs rather than
a single CPU).
Again, the drop in performance between early 2.6
kernels and later kernels is extremely marked (and
repeatable). Note, the early 2.6 kernels produced
performance similar to Windows and superior to Solaris
(on the same hardware). Later 2.6 now trails both
Windows and Solaris.
Maybe the rules related to how a thread is tied
to a CPU could also be a culprit? That could also
be an avenue of exploration.
I look forward to testing the effect of NUMA once
I know what to tweak.
You can use numactl to control numa on AMD64, large cpu IPF or Power systems.
We also recommend RHEL4 U2 as it fixes some known numa scheduler problems
especially for AMD64. I recommend the follow steps to tune;
1) Please retry with RHEL4 Update 2 with the default settings;
2) Disable numa using numactl --interleave=all command
NOTE this may actually decrease performance, memory latency may increase, but
it should be more consistant if your consuming all the shared memory in the
3) Use numastat to monitor your local vs remote data access patterns.
4) You can guide either cpu or memory binding with numactl (see man page)
5) Alternatively you can disable numa at the grub.conf level adding
"numa=off" on the boot line if (2) above is indicitive of your real user
On a 2-cpu AMD64 system here;
[root@perf3 ~]# numactl
usage: numactl [--interleave=nodes] [--preferred=node] [--cpubind=nodes]
[--membind=nodes] [--localalloc] command args ...
numactl [--length length] [--offset offset] [--mode shmmode] [--strict]
--shm shmkeyfile | --file tmpfsfile | --shmid id
memory policy is --interleave, --preferred, --membind, --localalloc
nodes is a comma delimited list of node numbers or none/all.
length can have g (GB), m (MB) or k (KB) suffixes
[root@perf3 ~]# numactl -show
preferred node: 0
nodebind: 0 1
membind: 0 1
[root@perf3 ~]# numastat
numa_hit 276324468 244707583
numa_miss 84952169 82108892
numa_foreign 82108892 84952169
interleave_hit 316735 310880
local_node 276183958 244602086
other_node 85092679 82214389
Thanks for the quick response, that's just the information I needed,
the grub configuration sounds like the easiest thing to toggle.
It may take a few days to get some results again (I need to
set things up from scratch). I'll let you guys know how
things have gone after I gather some results.
We appreciate the information.
I have carried out a series of tests with Red Hat EL4 Update 2, Fedora
Core 2 and Fedora Core 4. I used both Red Hat/Fedora and vanilla
Unfortunately toggling between numa enabled and disabled had no effect
on performance at all.
Case in point,
32-user load on Red Hat EL4 update 2
23 minutes 00 seconds (numa enabled)
23 minutes 02 seconds (numa disabled)
Note, I ran "dmesg | grep -i numa" just to make sure that numa was being
turned off when required (e.g 'NUMA turned off' would be emitted when
"numa=off" was specified in the grub.conf kernel line).
Similar results were encountered with vanilla kernels, as well as using
a Fedora Core 4 base.
I also compiled up and ran a new 2.6.15 kernel, it averaged
28 minutes 25 seconds.
This compares to previous kernels
I guess we are back to where we where before. Something has occurred
at kernel revisions 2.6.7, 2.6.9, 2.6.11 and 2.6.13 that has made
our high-thread-concurrency loads perform ever worse with NUMA not
being the cause.
Numa was a good avenue to explore. Unfortunately I do not believe it
to be the cause of this performance regression.
Interesting. During the run, what's the split between user/sys/wait and where
is the extra time getting allocated as the kernel revisions increase? I'm
wondering if what you're seeing here is interaction between the process and IO
scheduler. Depending on where the time is increasing, I'd be interested in
looking at the effects of using setsched() to change the user space processes
to use the round robin scheduler (which will also substantially impact the
behavior of the CFQ scheduler).
(My logic for this incidentally is that you're seeing the time increase faster
on your db test load than your straight malloc test).
Sorry, I'm assuming you've checked this, but its worth asking: Is there any
swap being used during this test? In our testing we've seen cases where swap
is used even when real memory is theoretically available.
> Interesting. During the run, what's the split between
> user/sys/wait and where is the extra time getting allocated as the
> kernel revisions increase?
I take it running the 'time' command will be sufficient to supply
I'll do three runs, one with the standard Red Hat EL4 Update 2
2.6.9-22 kernel. I'll do another run with the vanilla 2.6.5 kernel
and lastly one with the 2.6.15 kernel.
> Sorry, I'm assuming you've checked this, but its worth asking: Is
> there any swap being used during this test? In our testing we've
> seen cases where swap is used even when real memory is
> theoretically available.
The 'free' command will supply this information?
My 2xOpteron 246 system comes with 4GB of RAM. I have configured the
system with 8GB of SWAP.
I'll monitor swap usage.
> Depending on where the time is increasing, I'd be interested in
> looking at the effects of using setsched() to change the user
> space processes to use the round robin scheduler (which will also
> substantially impact the behavior of the CFQ scheduler).
Yes, I should also carry out my tests with the "AS" and "DEADLINE"
I/O schedulers. Let me first get some usr/sys/wait numbers with
> My logic for this incidentally is that you're seeing the time
> increase faster on your db test load than your straight malloc
The straight malloc tests did show a small dip in performance
between Fedora Core 2 and Fedora Core 3. However Fedora Core 4 shows
excellent malloc performance. Hence, my original assumption that
'malloc' performance had regressed was incorrect. Also, the malloc
tests as you said completely avoided the disk subsystem.
The database server do interact with the disk system, but only in a
read sense. My tests are straight query tests, e.g read database
index structures to satisfy queries. No disk writing occurs.
Thanks for the suggestions. It really is appreciated.
I have carried out the series of tests I mentioned in my previous
mail, using our standard 32-user load.
Three kernels tested, the real/user/sys times follow
2.6.15 Deadline scheduler
In all cases no swap was used at all.
Usage of round-robin scheduling was suggested as another avenue for
us to explore. I quickly hacked the following program
main(int argc, char** argv)
struct sched_param sp;
sp.sched_priority = 0;
sched_setscheduler(<<PID_VALUE>>, SCHED_RR, &sp);
I obtained the following time
Was I correctly using the sched_setscheduler function?
Any guidance would be appreciated.
It has been a while since the last correspondence. I am just
enquiring to see what the state of play was. We are still
keen down here to help.
This item of mine is very old now.
Please close/resolve it thanks.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life.
Please See https://access.redhat.com/support/policy/updates/errata/
If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.