Description of problem: On RHEL 4, when the included program is compiled as a 64 bit binary and executed on a system equipped with Intel processors, performance is considerably worse than when the same program is compiled as a 32 bit binary and executed on the same machine. Execution times on 64 bit AMD hardware are virtually the same for the 64 and 32 bit binaries. How reproducible: Customer is noting a much larger impact on his specific hardware platform (roughly 4x as slow for 64 bit than 32 bit). I am able to reproduce a significant performance difference on hardware I have tested, but it is in the scope of 2x as slow. Steps to Reproduce: (I have two lab machines reserved to reproduce this, if you need the addresses to confirm) # gcc -O3 sorttest.c -lm -o sorttest64 # gcc -O3 sorttest.c -lm -m32 -o sorttest32 # time ./sorttest64 20 # time ./sorttest32 20 Compare resulting execution times on Intel and AMD hardware. Actual results: on Intel hardware: # time ./sorttest64 20 real 3m30.056s user 3m29.295s sys 0m0.724s # time ./sorttest32 20 real 1m30.943s user 1m30.164s sys 0m0.762s on AMD hardware: # time ./sorttest64 20 real 2m20.643s user 2m19.949s sys 0m0.644s # time ./sorttest32 20 real 2m0.964s user 2m0.070s sys 0m0.850s Since these are different boxes with different specs, the actual execution time differences from one box to the other are not important. Rather, the important thing to note is how much better the performance of the 32 bit code is compared to the 64 bit code on the Intel hardware, while the AMD hardware shows very little difference. Expected results: Not such a huge performance hit for the 64 bit code on Intel hardware. Additional info: Customer has tested this on the same Intel hardware with RHEL 4, 5, two versions of SuSe, F8, and a few other distributions. The poor performance of the 64 bit code is only apparent in RHEL. To try to further flush this out, I copied: /lib64/ld-linux-x86-64.so.2 /lib64/libc.so.6 /lib64/libm.so.6 from a 64 bit F8 box to /root/f8 on the same Intel box that I ran the above tests on. I then ran: # time /root/f8/ld-linux-x86-64.so.2 --library-path /root/f8 /root/sorttest64 20 real 1m15.522s user 1m14.777s sys 0m0.732s With the above, it looks like I can also reproduce the customer's statement that this is not a problem on F8. ----------------------------------------------------------- /* compile: cc sorttest.c -O3 -lm -o sorttest */ #include <stdio.h> #include <stdlib.h> #include <math.h> #define N 10000000 /* vector of 10-M */ int scr[N], i, cnt, n, posx[N], srt_f(const void *, const void *); double a, aaa, posy[N]; int srt_f(const void *a, const void *b) { aaa = posy[*((int *)a)] - posy[*((int *)b)]; if ( aaa < 0. ) return(-1); return( aaa > 0. ); } int main(int argc, char *argv[]) { cnt = ( argc == 1 ) ? 1 : atoi(argv[1]); for ( n = 0; n < cnt; ++n ) { for ( i = 0; i < N; ++i ) { posx[i] = i; a += .001; posy[i] = sin(a); } qsort((void *) posx, i, sizeof(i), srt_f); for ( i = 0; i < N; ++i ) scr[posx[i]] = (1000*i)/N; } } Comment #1 From Alan Matsuoka (alanm) on 2008-02-27 15:01 EST [reply] Private OK. Here's the short answer. The performance problem that they are seeing with the RHEL, x86_64 combination and not with Fedora is quite simple. There was a change in libc for Fedora Core 8 for msort and qsort. http://sourceware.org/cgi-bin/cvsweb.cgi/libc/ChangeLog.diff?cvsroot=glibc&only_with_tag=fedora-branch&r1=1.8782.2.274&r2=1.8782.2.275 +2007-10-04 Jakub Jelinek <jakub> + + * stdlib/msort.c: Include stdint.h. + (struct msort_param): New type. + (msort_with_tmp): Use struct msort_param pointer for unchanging + parameters. Add optimized handling for several common sizes + and indirect sorting mode. + (qsort): Adjust msort_with_tmp callers. For big S use indirect + sorting. + Suggested by Belazougui Djamel . + + * stdlib/Makefile (tests): Add tst-qsort2. + * stdlib/tst-qsort2.c: New test. http://sourceware.org/cgi-bin/cvsweb.cgi/libc/stdlib/msort.c.diff?cvsroot=glibc&only_with_tag=fedora-branch&r1=1.21&r2=1.21.2.1 see : http://www.cygwin.com/ml/libc-alpha/2007-05/msg00022.html for the rationale Doing a bit of hacking and code substitution with the FC 8 version of msort and qsort gcc -O3 -I. -g -o sorttestnew sorttest.c msortnew.c qsortnew.c -lm msortnew.c: In function ‘msort_with_tmp’: msortnew.c:142: warning: cast to pointer from integer of different size msortnew.c:148: warning: cast to pointer from integer of different size [alanm@dt1 TEST]$ time ./sorttestnew 20 66.331u 0.637s 1:07.10 99.7% 0+0k 0+0io 0pf+0w [alanm@dt1 TEST]$ gcc -m32 -O3 -I. -g -o sorttestnew.m32 sorttest.c msortnew.c qsortnew.c -lm [alanm@dt1 TEST]$ time ./sorttestnew.m32 20 69.026u 0.659s 1:09.71 99.9% 0+0k 0+0io 2pf+0w ## This is the RHEL5 libc [alanm@dt1 TEST]$ time ./sorttest 20 187.127u 0.642s 3:07.80 99.9% 0+0k 0+0io 0pf+0w [alanm@dt1 TEST]$ time ./sorttest.m32 20 71.298u 0.683s 1:12.12 99.7% 0+0k 0+0io 4pf+0w Note that there is a 3X improvement in 64 bit mode for this particular testcase. This would be a better result than you would get because qsort and msort are not compiled with -fPIC in the hacked together case. I'll post the longer answer (how I arrived at this conclusion later today). Customer has confirmed that these test packages appear to provide much better performance and resolve their issues. I'm assuming that they simply extracted /lib64/ld-linux-x86-64.so.2, /lib64/libc.so.6, and /lib64/libm.so.6 from the glibc package though, because they are reporting a dependency problem when trying to upgrade to these test packages that appears to be related to the 32 bit versions of glibc they have installed. None the less, I don't see anything immediately wrong with their method of testing, so it still seems valid that this fix will work for them. If you want me to have them test by actually installing the rpms on their system, we'll just have to solve these dependencies (glibc-common = 2.5-18.el5_1.1 is needed by (installed) glibc-2.5-18.el5_1.1.i686).
Alan, you have to provide a really good reason for us to take this high risk. I'm closing the bug. If you have a really good reason reopen te bug.