Bug 170249
Summary: | malloc performance regression? | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Dennis <dennis> | ||||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4.0 | CC: | dshaks, jbaron, mingo, steve.russell | ||||||||
Target Milestone: | --- | Keywords: | FutureFeature | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Enhancement | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2012-06-20 16:10:13 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Created attachment 119758 [details]
malloc program text.
Created attachment 119759 [details]
malloc_thr program text.
When you want to test malloc performance, you should be using malloc, not new, as otherwise you are using new performance and not malloc performance. E.g. FC4 has completely different STL allocator (by default) from the other distros. Good point. However, I did use the same Fedora Core 2 libstdc++.so.6 in all Fedora tests, but not on Red Hat EL. Nevertheless I will test out raw 'malloc' performance. Maybe 'new' could be the issue. I will get back to you with some numbers soonish, doing a number of OS installs takes time. Just letting you know that our database server product is a moderately large (1-million line) C++ product. Our product does not use STL, our product has custom Vector, String etc implementations. Again, I can not state with confidence what is causing the performance issue, I am just guessing a little that it is malloc. Thanks, Dennis. I converted the original sample programs to C variants. It looks like single threaded malloc performance is not an issue. However I did see some issues with the multi-threaded test. I have attached a C file named malloc_thr.c. I compiled the program on Fedora Core 2 via, # gcc -O2 malloc_thr.c -lpthread -o malloc_thr I ran the program on a 2-cpu Opteron 10 times with the following results FC2 times 2min 05sec 2min 08sec 2min 07sec 2min 10sec 2min 04sec 2min 13sec 2min 11sec 2min 13sec 2min 12sec 2min 13sec I upgraded to Fedora Core 3 and re-run the same Fedora Core 2 compiled program FC3 times 2min 10sec 2min 36sec 2min 07sec 2min 28sec 3min 01sec 2min 50sec 2min 30sec 2min 31sec 2min 25sec 2min 31sec Upgrade to Fedora Core 4 FC4 times 2min 07sec 2min 05sec 2min 06sec 2min 07sec 2min 07sec 2min 08sec 2min 04sec 2min 07sec 2min 06sec 2min 11sec Install Red Hat EL4 update 1 and run the same Fedora Core 2 compiled program RH EL4 times 2min 45sec 2min 28sec 2min 36sec 2min 35sec 2min 32sec 2min 38sec 2min 40sec 2min 36sec 2min 34sec 2min 35sec As you can see it appears that Fedora Core 3 and Red Hat EL 4 are providing times that appear to be inferior to Fedora Core 2 and 4. My test creates 10 threads, this is run on a 2-cpu machine. The run machine is being stressed. Could my issue actually be kernel scheduling differences between the various revisions? Are there any differences of note? The fast Fedora Core 4 times do not marry up with the slow benchmarking with our product. Overall it appears that I may need to do some more analysis at my end to better determine the issue I am seeing. As it stands, my test program highlights an issue with FC3/RH4, however FC4 does not have that same issue. If FC4 was also slow then I believe there would be something to investigate. However, there is not. Maybe this issue should be closed. Dennis. Created attachment 119794 [details]
Multi-thread malloc C program
I have had more of a play with this issue and have learnt quite a bit. The issue highlighted by the benchmarks done on our database server indicate that the problem is not glibc malloc, or C++ new, or selinux (etc etc). The slowdown in performance is kernel related. I'll quickly refresh things. I carried out some benchmark tests against our database server product. One test creates 24 client processes (simultanously) on a separate 2-cpu machine (running solaris). The 24 client processes connect up to a our database server (which spawns a thread per connection) which handles the client load (each client will be issueing a number of queries). The database server is running on a 2-cpu Opteron 246 machine with 4GB or RAM with a SCSI 72GB disk. On Fedora Core 2 handling that 24-user load takes 15 minutes. On Red Hat EL4 Update 1 handling that 24-user load takes 17 minutes. On Fedora Core 4 handling that 24-user load takes 18 minutes. I then installed the Fedora Core 2 kernel onto the Fedora Core 4 installation and rebooted with the FC2 kernel. I re-ran the test, the result came out as 15 minutes. Note, all tests were run multiple times. Something has happened to the kernel between FC2 and FC3. The newer kernels do not handle extremely high user load as well as the older kernel. For low user loadings (e.g the number of users and number of cpus match) there is no issue. Only when the user/thread load is far larger than the number of processors does an issue arise. Has something changed with kernel process scheduling? Is there something I can change environmentally in /proc? It may be the case that this request should be closed since glibc is not the culprit (apologies for the red herring). Also, I don't have a sample program to supply. I may need to hack something together to highlight this performance regression. Anyway I guess the point of this latest text is to find out from you guys whether any of the above text rings a bell. If there is any information you can provide that would be great. Yep, the newer kernel is the issue. Dennis. I have done quite a bit more testing. This time I compiled up a whole series of vanilla kernels, from 2.6.5 all the way through to 2.6.14. Standard x86-64 kernel builds were performed, I did not fiddle any configuration settings. Again I conducted our heavy query load test. A 32-user load was performed on the 2 CPU Opteron box. Note, a query load does not do any disk writing, however quite a bit of disk reading occurs (to read index structures from disk to memory). I conducted each 32-user load 4 times prior to testing the next kernel. I averaged the times for the test. The following results were seen kernel mins:secs 2.6.5 20:46 2.6.6 20:47 2.6.7 21:50 2.6.8 22:13 2.6.9 22:50 2.6.10 21:28 2.6.11 23:29 2.6.12 24:24 2.6.13 28:48 2.6.14 29:17 The timings observed with the standard kernels mirror the times observed with the equivalent Fedora Core 2/4 and Red Hat 4 kernels. That is the Red Hat EL 4 2.6.9 kernel produced timings that were extremely similar to the vanilla 2.6.9 kernel. Hence, I believe the issue to be with the standard kernel. The numbers above fall into distinct groups. 2.6.5 and 2.6.6 are basically the same. Something happened with 2.6.7 to effect performance negatively (maybe scheduler domains?). 2.6.8 mirrors 2.6.7. The 2.6.9 kernel (which is also used with Red Hat EL 4) suffered another noticeable drop off in performance. The 2.6.10 appears to result in 2.6.7-like performance. The 2.6.11 suffers a major drop in performance. Likewise the 2.6.13 suffers from an even more obvious performance issue. Basically I believe that each of: 2.6.7, 2.6.9, 2.6.11 and 2.6.13 are the releases of interest. Each of which progressively results in worse performance. Hmmm, every odd numbered release, weird. The difference in performance between 2.6.5 and 2.6.14 is startling (a nearly 50% drop in performance). This feels like a scheduler issue, noting that that the concurrent 32-user load is trying to be satisfied by a 2 CPU machine. Note, our test tries to mimic a real world use case. It is not an artificial test. I believe this test is highlighting a legitimate issue with the 2.6 series of kernels. I am more than willing to test out some patches if need be. This drop off in performance is concerning, it would be great to find out what is going on here. Dennis. P.S. I believe this bug item should be renamed from 'malloc performance regression?' to 'kernel regression' What is your NUMA setting (on or off). regardless, you might want to try to pin each of the processes to a particular CPU to normalize for NUMA behavior. Steve Steve, I did a standard Red Hat install. How do I deduce the current system NUMA status? How do I toggle between NUMA enabled and disabled? Pinning our database server to a particular CPU is not ideal. Our server is a single-process-multi-threaded instance. Restricting the single process to a single CPU will result in reduced performance for high concurrency user loadings of our server (e.g it is better for the process to use both CPUs rather than a single CPU). Again, the drop in performance between early 2.6 kernels and later kernels is extremely marked (and repeatable). Note, the early 2.6 kernels produced performance similar to Windows and superior to Solaris (on the same hardware). Later 2.6 now trails both Windows and Solaris. Maybe the rules related to how a thread is tied to a CPU could also be a culprit? That could also be an avenue of exploration. I look forward to testing the effect of NUMA once I know what to tweak. Dennis. You can use numactl to control numa on AMD64, large cpu IPF or Power systems. We also recommend RHEL4 U2 as it fixes some known numa scheduler problems especially for AMD64. I recommend the follow steps to tune; 1) Please retry with RHEL4 Update 2 with the default settings; 2) Disable numa using numactl --interleave=all command NOTE this may actually decrease performance, memory latency may increase, but it should be more consistant if your consuming all the shared memory in the system. 3) Use numastat to monitor your local vs remote data access patterns. 4) You can guide either cpu or memory binding with numactl (see man page) 5) Alternatively you can disable numa at the grub.conf level adding "numa=off" on the boot line if (2) above is indicitive of your real user load. Shak On a 2-cpu AMD64 system here; [root@perf3 ~]# numactl usage: numactl [--interleave=nodes] [--preferred=node] [--cpubind=nodes] [--membind=nodes] [--localalloc] command args ... numactl [--show] numactl [--hardware] numactl [--length length] [--offset offset] [--mode shmmode] [--strict] --shm shmkeyfile | --file tmpfsfile | --shmid id [--huge] [--touch] memory policy memory policy is --interleave, --preferred, --membind, --localalloc nodes is a comma delimited list of node numbers or none/all. length can have g (GB), m (MB) or k (KB) suffixes [root@perf3 ~]# numactl -show policy: default preferred node: 0 interleavemask: interleavenode: 0 nodebind: 0 1 membind: 0 1 [root@perf3 ~]# numastat node1 node0 numa_hit 276324468 244707583 numa_miss 84952169 82108892 numa_foreign 82108892 84952169 interleave_hit 316735 310880 local_node 276183958 244602086 other_node 85092679 82214389 Thanks for the quick response, that's just the information I needed, the grub configuration sounds like the easiest thing to toggle. It may take a few days to get some results again (I need to set things up from scratch). I'll let you guys know how things have gone after I gather some results. We appreciate the information. Dennis. I have carried out a series of tests with Red Hat EL4 Update 2, Fedora Core 2 and Fedora Core 4. I used both Red Hat/Fedora and vanilla kernels. Unfortunately toggling between numa enabled and disabled had no effect on performance at all. Case in point, 32-user load on Red Hat EL4 update 2 23 minutes 00 seconds (numa enabled) 23 minutes 02 seconds (numa disabled) Note, I ran "dmesg | grep -i numa" just to make sure that numa was being turned off when required (e.g 'NUMA turned off' would be emitted when "numa=off" was specified in the grub.conf kernel line). Similar results were encountered with vanilla kernels, as well as using a Fedora Core 4 base. I also compiled up and ran a new 2.6.15 kernel, it averaged 28 minutes 25 seconds. This compares to previous kernels kernel mins:secs 2.6.5 20:46 2.6.6 20:47 2.6.7 21:50 2.6.8 22:13 2.6.9 22:50 2.6.10 21:28 2.6.11 23:29 2.6.12 24:24 2.6.13 28:48 2.6.14 29:17 I guess we are back to where we where before. Something has occurred at kernel revisions 2.6.7, 2.6.9, 2.6.11 and 2.6.13 that has made our high-thread-concurrency loads perform ever worse with NUMA not being the cause. Numa was a good avenue to explore. Unfortunately I do not believe it to be the cause of this performance regression. Dennis. Interesting. During the run, what's the split between user/sys/wait and where is the extra time getting allocated as the kernel revisions increase? I'm wondering if what you're seeing here is interaction between the process and IO scheduler. Depending on where the time is increasing, I'd be interested in looking at the effects of using setsched() to change the user space processes to use the round robin scheduler (which will also substantially impact the behavior of the CFQ scheduler). (My logic for this incidentally is that you're seeing the time increase faster on your db test load than your straight malloc test). Tx Steve Sorry, I'm assuming you've checked this, but its worth asking: Is there any swap being used during this test? In our testing we've seen cases where swap is used even when real memory is theoretically available. Tx Steve > Interesting. During the run, what's the split between > user/sys/wait and where is the extra time getting allocated as the > kernel revisions increase? I take it running the 'time' command will be sufficient to supply these values? I'll do three runs, one with the standard Red Hat EL4 Update 2 2.6.9-22 kernel. I'll do another run with the vanilla 2.6.5 kernel and lastly one with the 2.6.15 kernel. > Sorry, I'm assuming you've checked this, but its worth asking: Is > there any swap being used during this test? In our testing we've > seen cases where swap is used even when real memory is > theoretically available. The 'free' command will supply this information? My 2xOpteron 246 system comes with 4GB of RAM. I have configured the system with 8GB of SWAP. I'll monitor swap usage. > Depending on where the time is increasing, I'd be interested in > looking at the effects of using setsched() to change the user > space processes to use the round robin scheduler (which will also > substantially impact the behavior of the CFQ scheduler). Yes, I should also carry out my tests with the "AS" and "DEADLINE" I/O schedulers. Let me first get some usr/sys/wait numbers with "CFQ". > My logic for this incidentally is that you're seeing the time > increase faster on your db test load than your straight malloc > test. The straight malloc tests did show a small dip in performance between Fedora Core 2 and Fedora Core 3. However Fedora Core 4 shows excellent malloc performance. Hence, my original assumption that 'malloc' performance had regressed was incorrect. Also, the malloc tests as you said completely avoided the disk subsystem. The database server do interact with the disk system, but only in a read sense. My tests are straight query tests, e.g read database index structures to satisfy queries. No disk writing occurs. Thanks for the suggestions. It really is appreciated. Dennis. I have carried out the series of tests I mentioned in my previous mail, using our standard 32-user load. Three kernels tested, the real/user/sys times follow 2.6.5 ===== real 21m28.007s user 34m25.911s sys 6m31.746s 2.6.9 ===== real 22m51.229s user 33m50.504s sys 5m21.423s 2.6.15 ====== real 27m23.989s user 33m3.532s sys 5m42.729s 2.6.15 Deadline scheduler ========================= real 27m25.961s user 33m0.760s sys 5m36.549s In all cases no swap was used at all. Usage of round-robin scheduling was suggested as another avenue for us to explore. I quickly hacked the following program #include <sched.h> #include <sys/time.h> #include <sys/resource.h> #include <stdlib.h> int main(int argc, char** argv) { struct sched_param sp; sp.sched_priority = 0; sched_setscheduler(<<PID_VALUE>>, SCHED_RR, &sp); return 0; } I obtained the following time 2.6.15 ====== real 27m57.030s user 33m1.628s sys 5m56.386s Was I correctly using the sched_setscheduler function? Any guidance would be appreciated. Dennis. It has been a while since the last correspondence. I am just enquiring to see what the state of play was. We are still keen down here to help. Thanks, Dennis. This item of mine is very old now. Please close/resolve it thanks. Dennis. Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue. |
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0 Description of problem: We produce a database server product. Periodically I run benchmark tests to test the performance of our product. When I did these recently I noticed an issue. That lead me to conduct a series of further tests. As a baseline I compiled and ran our server product on Fedora Core 2. The test machine was configured as 2 x Opteron 246 4 RAM single 15,000 RPM SCSI disk Only essential services were running (e.g sendmail and many other daemons were disabled). I then ran a 32-user concurrent benchmark test (twice), the average time taken was 20 minutes I upgraded to Fedora Core 3. Again non-essential services including SELinux were disabled. I re-ran the benchmark test using the Fedora Core 2 compiled up version of our product. The average time taken was 23 minutes I upgraded to Fedora Core 4. The average time taken was 24 minutes I installed Red Hat Enterprise Linux 4 update 1. I compiled up our product and re-run the benchmark test. The average time taken was 22 minutes Basically, the newer the Fedora/Red Hat version the worse the performance. To test out whether malloc was the performance problem I wrote 2 small malloc programs: malloc (a single thread malloc program), malloc_thr (a multi-thread malloc program). Both program source files are attached to this bug item. I compiled the programs as follows # g++ -O2 malloc.cc -o malloc # g++ -O2 malloc_thr.cc -lpthread -o malloc_thr I compiled these programs on Fedora Core 2 and then ran each program multiple times on Fedora Core 2, Fedora Core 3, Fedora Core 4 and Red Hat Enterprise Linux 4 Update 1 (on the same machine). average malloc program times FC2: 1 min 35 secs FC3: 1 min 56 secs FC4: 1 min 39 secs RH4: 1 min 47 secs average malloc_thr program times FC2: 1 min 52 secs FC3: 2 min 18 secs FC4: 1 min 59 secs RH4: 2 min 26 secs It does appear that malloc performance has regressed in the versions after Fedora Core 2. I notice in the Fedora Core 3 and Red Hat EL 4 release notes the following text > The version of glibc provided with Fedora Core 3 performs additional > internal sanity checks to prevent and detect data corruption as early > as possible. I re-ran some tests with 'export MALLOC_CHECK_=0' and saw no performance difference. These tests when run on the same machine with different Fedora Core/Red Hat appear to highlight a malloc slowdown. My question is whether malloc is the cause of my performance regression on Linux distribution versions after Fedora Core 2. Is there anything I can do in userland to obtain superior performance in Red Hat EL4 (e.g a compilation flag). This performance issue leads to Fedora Core 4 being 17% slower than Fedora Core 2 when running our benchmarks. I am more than willing to run some further tests down here if needed. Or is it the case that I will just have to live with the slower performance? It should also be noted that our product benchmark tests showed Fedora Core 2 to be superior to Windows XP (with default allocator) and Solaris 10 (with ptmalloc) when run on the same hardware. However, the more recent Fedora Core versions (3 & 4) and Red Hat EL 4 now trail Windows XP/Solaris 10 performance wise. I would prefer Red Hat EL 4 to have Fedora Core 2 like performance. So this request is not really a proper bug item, it is more an information gathering excercise (initially) related to performance. We look forward to any information provided back to us. Thanks for your time. Dennis. Version-Release number of selected component (if applicable): glibc-2.3.4-2.9 How reproducible: Always Steps to Reproduce: 1. compile the test test program attached via # g++ -O2 malloc.cc -o malloc # g++ -O2 malloc_thr.cc -lpthread -o malloc_thr 2. Run the programs on Fedora Core 2, 3, 4 and Red Hat EL 4 Actual Results: The expected results will appear to show that Fedora Core 2 is fast whilst the later Fedora and Red Hat versions are slow. Expected Results: It would be nice to fast malloc performance such as displayed by Fedora Core 2. Additional info: