Bug 170249

Summary: malloc performance regression?
Product: Red Hat Enterprise Linux 4 Reporter: Dennis <dennis>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: dshaks, jbaron, mingo, steve.russell
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 16:10:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
malloc program text.
none
malloc_thr program text.
none
Multi-thread malloc C program none

Description Dennis 2005-10-10 07:06:00 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0

Description of problem:
We produce a database server product. Periodically I run benchmark tests
to test the performance of our product. When I did these recently I
noticed an issue. That lead me to conduct a series of further tests. As
a baseline I compiled and ran our server product on Fedora Core 2. The
test machine was configured as

2 x Opteron 246
4 RAM
single 15,000 RPM SCSI disk

Only essential services were running (e.g sendmail and many other
daemons were disabled).

I then ran a 32-user concurrent benchmark test (twice), the average time
taken was

20 minutes

I upgraded to Fedora Core 3. Again non-essential services including
SELinux were disabled. I re-ran the benchmark test using the Fedora Core
2 compiled up version of our product. The average time taken was

23 minutes

I upgraded to Fedora Core 4. The average time taken was

24 minutes

I installed Red Hat Enterprise Linux 4 update 1. I compiled up our
product and re-run the benchmark test. The average time taken was

22 minutes

Basically, the newer the Fedora/Red Hat version the worse the
performance. To test out whether malloc was the performance problem I
wrote 2 small malloc programs: malloc (a single thread malloc program),
malloc_thr (a multi-thread malloc program). Both program source files
are attached to this bug item. I compiled the programs as follows

# g++ -O2 malloc.cc -o malloc
# g++ -O2 malloc_thr.cc -lpthread -o malloc_thr

I compiled these programs on Fedora Core 2 and then ran each program
multiple times on Fedora Core 2, Fedora Core 3, Fedora Core 4 and Red
Hat Enterprise Linux 4 Update 1 (on the same machine).

average malloc program times

FC2: 1 min 35 secs
FC3: 1 min 56 secs
FC4: 1 min 39 secs
RH4: 1 min 47 secs

average malloc_thr program times

FC2: 1 min 52 secs
FC3: 2 min 18 secs
FC4: 1 min 59 secs
RH4: 2 min 26 secs

It does appear that malloc performance has regressed in the versions
after Fedora Core 2. I notice in the Fedora Core 3 and Red Hat EL 4
release notes the following text

> The version of glibc provided with Fedora Core 3 performs additional
> internal sanity checks to prevent and detect data corruption as early
> as possible.

I re-ran some tests with 'export MALLOC_CHECK_=0' and saw no performance
difference. These tests when run on the same machine with different
Fedora Core/Red Hat appear to highlight a malloc slowdown.

My question is whether malloc is the cause of my performance regression
on Linux distribution versions after Fedora Core 2. Is there anything I
can do in userland to obtain superior performance in Red Hat EL4 (e.g a
compilation flag). This performance issue leads to Fedora Core 4 being 
17% slower than Fedora Core 2 when running our benchmarks.

I am more than willing to run some further tests down here if needed.
Or is it the case that I will just have to live with the slower
performance? It should also be noted that our product benchmark tests
showed Fedora Core 2 to be superior to Windows XP (with default allocator)
and Solaris 10 (with ptmalloc) when run on the same hardware. However,
the more recent Fedora Core versions (3 & 4) and Red Hat EL 4 now trail
Windows XP/Solaris 10 performance wise. I would prefer Red Hat EL 4 to
have Fedora Core 2 like performance.

So this request is not really a proper bug item, it is more an
information gathering excercise (initially) related to performance. 

We look forward to any information provided back to us. Thanks for your
time.

Dennis.

Version-Release number of selected component (if applicable):
glibc-2.3.4-2.9

How reproducible:
Always

Steps to Reproduce:
1. compile the test test program attached via

# g++ -O2 malloc.cc -o malloc
# g++ -O2 malloc_thr.cc -lpthread -o malloc_thr

2. Run the programs on Fedora Core 2, 3, 4 and Red Hat EL 4
  

Actual Results:  The expected results will appear to show that Fedora Core 2 is fast whilst
the later Fedora and Red Hat versions are slow.

Expected Results:  It would be nice to fast malloc performance such as displayed by
Fedora Core 2.

Additional info:

Comment 1 Dennis 2005-10-10 07:08:00 UTC
Created attachment 119758 [details]
malloc program text.

Comment 2 Dennis 2005-10-10 07:08:52 UTC
Created attachment 119759 [details]
malloc_thr program text.

Comment 3 Jakub Jelinek 2005-10-10 07:40:19 UTC
When you want to test malloc performance, you should be using malloc, not new,
as otherwise you are using new performance and not malloc performance.
E.g. FC4 has completely different STL allocator (by default) from the other
distros.

Comment 4 Dennis 2005-10-10 08:11:09 UTC
Good point. 

However, I did use the same Fedora Core 2 libstdc++.so.6 in all Fedora
tests, but not on Red Hat EL. 

Nevertheless I will test out raw 'malloc' performance. Maybe 'new' could 
be the issue. I will get back to you with some numbers soonish, doing
a number of OS installs takes time.

Just letting you know that our database server product is a moderately
large (1-million line) C++ product. Our product does not use STL, our
product has custom Vector, String etc implementations.

Again, I can not state with confidence what is causing the performance
issue, I am just guessing a little that it is malloc.

Thanks,

Dennis.

Comment 5 Dennis 2005-10-11 07:35:17 UTC
I converted the original sample programs to C variants.

It looks like single threaded malloc performance is not an issue.

However I did see some issues with the multi-threaded test.
I have attached a C file named malloc_thr.c.

I compiled the program on Fedora Core 2 via,

# gcc -O2 malloc_thr.c -lpthread -o malloc_thr

I ran the program on a 2-cpu Opteron 10 times with the
following results

FC2 times
2min 05sec
2min 08sec
2min 07sec
2min 10sec
2min 04sec
2min 13sec
2min 11sec
2min 13sec
2min 12sec
2min 13sec

I upgraded to Fedora Core 3 and re-run the same Fedora Core 2
compiled program

FC3 times
2min 10sec
2min 36sec
2min 07sec
2min 28sec
3min 01sec
2min 50sec
2min 30sec
2min 31sec
2min 25sec
2min 31sec

Upgrade to Fedora Core 4

FC4 times
2min 07sec
2min 05sec
2min 06sec
2min 07sec
2min 07sec
2min 08sec
2min 04sec
2min 07sec
2min 06sec
2min 11sec

Install Red Hat EL4 update 1 and run the same Fedora Core 2
compiled program

RH EL4 times
2min 45sec
2min 28sec
2min 36sec
2min 35sec
2min 32sec
2min 38sec
2min 40sec
2min 36sec
2min 34sec
2min 35sec

As you can see it appears that Fedora Core 3 and Red Hat EL 4 are
providing times that appear to be inferior to Fedora Core 2 and 4.
My test creates 10 threads, this is run on a 2-cpu machine. The
run machine is being stressed.

Could my issue actually be kernel scheduling differences between
the various revisions? Are there any differences of note? The fast
Fedora Core 4 times do not marry up with the slow benchmarking with our
product. Overall it appears that I may need to do some more analysis
at my end to better determine the issue I am seeing.

As it stands, my test program highlights an issue with FC3/RH4, however
FC4 does not have that same issue. If FC4 was also slow then I believe
there would be something to investigate. However, there is not.

Maybe this issue should be closed. 

Dennis.

Comment 6 Dennis 2005-10-11 07:36:16 UTC
Created attachment 119794 [details]
Multi-thread malloc C program

Comment 7 Dennis 2005-10-21 05:30:09 UTC
I have had more of a play with this issue and have learnt quite a bit.
The issue highlighted by the benchmarks done on our database server
indicate that the problem is not glibc malloc, or C++ new, or selinux
(etc etc). The slowdown in performance is kernel related.

I'll quickly refresh things. I carried out some benchmark tests against
our database server product. One test creates 24 client processes
(simultanously) on a separate 2-cpu machine (running solaris). The 24
client processes connect up to a our database server (which spawns a
thread per connection) which handles the client load (each client will
be issueing a number of queries). The database server is running on
a 2-cpu Opteron 246 machine with 4GB or RAM with a SCSI 72GB disk.

On Fedora Core 2 handling that 24-user load takes 15 minutes.
On Red Hat EL4 Update 1 handling that 24-user load takes 17 minutes.
On Fedora Core 4 handling that 24-user load takes 18 minutes.

I then installed the Fedora Core 2 kernel onto the Fedora Core 4
installation and rebooted with the FC2 kernel. I re-ran the test,
the result came out as 15 minutes.

Note, all tests were run multiple times.

Something has happened to the kernel between FC2 and FC3. The newer
kernels do not handle extremely high user load as well as the older
kernel. For low user loadings (e.g the number of users and number of cpus
match) there is no issue. Only when the user/thread load is far larger
than the number of processors does an issue arise.

Has something changed with kernel process scheduling?

Is there something I can change environmentally in /proc?

It may be the case that this request should be closed since glibc is not
the culprit (apologies for the red herring). Also, I don't have a sample
program to supply. I may need to hack something together to highlight
this performance regression.

Anyway I guess the point of this latest text is to find out from you
guys whether any of the above text rings a bell. If there is any
information you can provide that would be great.

Yep, the newer kernel is the issue.

Dennis.

Comment 8 Dennis 2005-11-04 04:32:35 UTC
I have done quite a bit more testing. This time I compiled up a
whole series of vanilla kernels, from 2.6.5 all the way through
to 2.6.14.  Standard x86-64 kernel builds were performed, I did
not fiddle any configuration settings.

Again I conducted our heavy query load test. A 32-user load was
performed on the 2 CPU Opteron box. Note, a query load does not
do any disk writing, however quite a bit of disk reading occurs
(to read index structures from disk to memory).

I conducted each 32-user load 4 times prior to testing the next
kernel. I averaged the times for the test. The following results
were seen

kernel   mins:secs
2.6.5    20:46
2.6.6    20:47
2.6.7    21:50
2.6.8    22:13
2.6.9    22:50
2.6.10   21:28
2.6.11   23:29
2.6.12   24:24
2.6.13   28:48
2.6.14   29:17

The timings observed with the standard kernels mirror the times
observed with the equivalent Fedora Core 2/4 and Red Hat 4
kernels. That is the Red Hat EL 4 2.6.9 kernel produced timings
that were extremely similar to the vanilla 2.6.9 kernel. Hence, I
believe the issue to be with the standard kernel.

The numbers above fall into distinct groups.

2.6.5 and 2.6.6 are basically the same.

Something happened with 2.6.7 to effect performance negatively
(maybe scheduler domains?). 2.6.8 mirrors 2.6.7.

The 2.6.9 kernel (which is also used with Red Hat EL 4) suffered
another noticeable drop off in performance.

The 2.6.10 appears to result in 2.6.7-like performance.

The 2.6.11 suffers a major drop in performance.

Likewise the 2.6.13 suffers from an even more obvious performance
issue.

Basically I believe that each of: 2.6.7, 2.6.9, 2.6.11 and 2.6.13
are the releases of interest. Each of which progressively results
in worse performance. Hmmm, every odd numbered release, weird.
The difference in performance between 2.6.5 and 2.6.14 is
startling (a nearly 50% drop in performance).

This feels like a scheduler issue, noting that that the
concurrent 32-user load is trying to be satisfied by a 2 CPU
machine.

Note, our test tries to mimic a real world use case. It is not an
artificial test. I believe this test is highlighting a legitimate
issue with the 2.6 series of kernels.

I am more than willing to test out some patches if need be. This
drop off in performance is concerning, it would be great to find
out what is going on here.

Dennis.

P.S. I believe this bug item should be renamed from 
'malloc performance regression?' to 'kernel regression'

Comment 9 Steve Russell 2006-01-11 18:42:44 UTC
What is your NUMA setting (on or off).  regardless, you might want to try to 
pin each of the processes to a particular CPU to normalize for NUMA behavior.

Steve

Comment 10 Dennis 2006-01-12 01:04:07 UTC
Steve,

I did a standard Red Hat install.

How do I deduce the current system NUMA status?
How do I toggle between NUMA enabled and disabled?

Pinning our database server to a particular CPU is not
ideal. Our server is a single-process-multi-threaded
instance. Restricting the single process to a single
CPU will result in reduced performance for high 
concurrency user loadings of our server (e.g it is 
better for the process to use both CPUs rather than
a single CPU).

Again, the drop in performance between early 2.6
kernels and later kernels is extremely marked (and
repeatable). Note, the early 2.6 kernels produced 
performance similar to Windows and superior to Solaris
(on the same hardware). Later 2.6 now trails both
Windows and Solaris.

Maybe the rules related to how a thread is tied
to a CPU could also be a culprit? That could also
be an avenue of exploration.

I look forward to testing the effect of NUMA once
I know what to tweak.

Dennis.

Comment 11 John Shakshober 2006-01-12 01:56:43 UTC
You can use numactl to control numa on AMD64, large cpu IPF or Power systems.
We also recommend RHEL4 U2 as it fixes some known numa scheduler problems 
especially for AMD64.  I recommend the follow steps to tune;

1) Please retry with RHEL4 Update 2 with the default settings;
2) Disable numa using numactl --interleave=all command
NOTE this may actually decrease performance, memory latency may increase, but 
it should be more consistant if your consuming all the shared memory in the 
system.
3) Use numastat to monitor your local vs remote data access patterns.
4) You can guide either cpu or memory binding with numactl (see man page)
5) Alternatively you can disable numa at the grub.conf level adding
   "numa=off" on the boot line if (2) above is indicitive of your real user 
load.

Shak

On a 2-cpu AMD64 system here;
[root@perf3 ~]# numactl
usage: numactl [--interleave=nodes] [--preferred=node] [--cpubind=nodes]
               [--membind=nodes] [--localalloc] command args ...
       numactl [--show]
       numactl [--hardware]
       numactl [--length length] [--offset offset] [--mode shmmode] [--strict]
               --shm shmkeyfile | --file tmpfsfile | --shmid id
               [--huge] [--touch]
               memory policy

memory policy is --interleave, --preferred, --membind, --localalloc
nodes is a comma delimited list of node numbers or none/all.
length can have g (GB), m (MB) or k (KB) suffixes
[root@perf3 ~]# numactl -show
policy: default
preferred node: 0
interleavemask: 
interleavenode: 0
nodebind: 0 1 
membind: 0 1 
[root@perf3 ~]# numastat
                         node1         node0
numa_hit             276324468     244707583
numa_miss             84952169      82108892
numa_foreign          82108892      84952169
interleave_hit          316735        310880
local_node           276183958     244602086
other_node            85092679      82214389

Comment 12 Dennis 2006-01-12 02:45:05 UTC
Thanks for the quick response, that's just the information I needed,
the grub configuration sounds like the easiest thing to toggle.

It may take a few days to get some results again (I need to
set things up from scratch). I'll let you guys know how 
things have gone after I gather some results.

We appreciate the information.

Dennis.

Comment 13 Dennis 2006-01-17 06:52:43 UTC
I have carried out a series of tests with Red Hat EL4 Update 2, Fedora
Core 2 and Fedora Core 4. I used both Red Hat/Fedora and vanilla
kernels.

Unfortunately toggling between numa enabled and disabled had no effect
on performance at all.

Case in point,

32-user load on Red Hat EL4 update 2

23 minutes 00 seconds (numa enabled)
23 minutes 02 seconds (numa disabled)

Note, I ran "dmesg | grep -i numa" just to make sure that numa was being
turned off when required (e.g 'NUMA turned off' would be emitted when
"numa=off" was specified in the grub.conf kernel line).

Similar results were encountered with vanilla kernels, as well as using
a Fedora Core 4 base.

I also compiled up and ran a new 2.6.15 kernel, it averaged

28 minutes 25 seconds.

This compares to previous kernels

kernel   mins:secs
2.6.5    20:46
2.6.6    20:47
2.6.7    21:50
2.6.8    22:13
2.6.9    22:50
2.6.10   21:28
2.6.11   23:29
2.6.12   24:24
2.6.13   28:48
2.6.14   29:17

I guess we are back to where we where before. Something has occurred
at kernel revisions 2.6.7, 2.6.9, 2.6.11 and 2.6.13 that has made
our high-thread-concurrency loads perform ever worse with NUMA not
being the cause.

Numa was a good avenue to explore. Unfortunately I do not believe it
to be the cause of this performance regression.

Dennis.

Comment 14 Steve Russell 2006-01-17 14:37:06 UTC
Interesting.  During the run, what's the split between user/sys/wait and where 
is the extra time getting allocated as the kernel revisions increase? I'm 
wondering if what you're seeing here is interaction between the process and IO 
scheduler.  Depending on where the time is increasing, I'd be interested in 
looking at the effects of using setsched() to change the user space processes 
to use the round robin scheduler (which will also substantially impact the 
behavior of the CFQ scheduler). 

(My logic for this incidentally is that you're seeing the time increase faster 
on your db test load than your straight malloc test).

Tx

Steve

Comment 15 Steve Russell 2006-01-17 14:52:43 UTC
Sorry, I'm assuming you've checked this, but its worth asking: Is there any 
swap being used during this test?  In our testing we've seen cases where swap 
is used even when real memory is theoretically available.  

Tx

Steve

Comment 16 Dennis 2006-01-17 23:31:06 UTC
> Interesting.  During the run, what's the split between
> user/sys/wait and where is the extra time getting allocated as the
> kernel revisions increase?

I take it running the 'time' command will be sufficient to supply
these values?

I'll do three runs, one with the standard Red Hat EL4 Update 2
2.6.9-22 kernel. I'll do another run with the vanilla 2.6.5 kernel
and lastly one with the 2.6.15 kernel.

> Sorry, I'm assuming you've checked this, but its worth asking: Is
> there any swap being used during this test?  In our testing we've
> seen cases where swap is used even when real memory is
> theoretically available.

The 'free' command will supply this information?

My 2xOpteron 246 system comes with 4GB of RAM. I have configured the
system with 8GB of SWAP.

I'll monitor swap usage.

> Depending on where the time is increasing, I'd be interested in
> looking at the effects of using setsched() to change the user
> space processes to use the round robin scheduler (which will also
> substantially impact the behavior of the CFQ scheduler).

Yes, I should also carry out my tests with the "AS" and "DEADLINE"
I/O schedulers. Let me first get some usr/sys/wait numbers with
"CFQ".

> My logic for this incidentally is that you're seeing the time
> increase faster on your db test load than your straight malloc
> test.

The straight malloc tests did show a small dip in performance
between Fedora Core 2 and Fedora Core 3. However Fedora Core 4 shows
excellent malloc performance. Hence, my original assumption that
'malloc' performance had regressed was incorrect. Also, the malloc
tests as you said completely avoided the disk subsystem.

The database server do interact with the disk system, but only in a
read sense. My tests are straight query tests, e.g read database
index structures to satisfy queries. No disk writing occurs.

Thanks for the suggestions. It really is appreciated.

Dennis.

Comment 17 Dennis 2006-01-18 07:13:54 UTC
I have carried out the series of tests I mentioned in my previous
mail, using our standard 32-user load.

Three kernels tested, the real/user/sys times follow

2.6.5
=====

real    21m28.007s
user    34m25.911s
sys     6m31.746s

2.6.9
=====

real    22m51.229s
user    33m50.504s
sys     5m21.423s

2.6.15
======

real    27m23.989s
user    33m3.532s
sys     5m42.729s

2.6.15 Deadline scheduler
=========================

real    27m25.961s
user    33m0.760s
sys     5m36.549s

In all cases no swap was used at all.

Usage of round-robin scheduling was suggested as another avenue for
us to explore. I quickly hacked the following program

#include <sched.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <stdlib.h>

int
main(int argc, char** argv)
{
    struct sched_param sp;
    sp.sched_priority = 0;
    sched_setscheduler(<<PID_VALUE>>, SCHED_RR, &sp);
    return 0;
}

I obtained the following time

2.6.15
======
real    27m57.030s
user    33m1.628s
sys     5m56.386s

Was I correctly using the sched_setscheduler function?

Any guidance would be appreciated.

Dennis.

Comment 18 Dennis 2006-02-22 07:00:24 UTC
It has been a while since the last correspondence. I am just 
enquiring to see what the state of play was. We are still
keen down here to help.

Thanks,

Dennis.

Comment 19 Dennis 2011-03-08 06:29:20 UTC
This item of mine is very old now. 

Please close/resolve it thanks.

Dennis.

Comment 20 Jiri Pallich 2012-06-20 16:10:13 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.