Bug 436115

Summary:	poor 64 bit Intel performance on sorting program (RHEL4)
Product:	Red Hat Enterprise Linux 4	Reporter:	Alan Matsuoka <alanm>
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	drepper, fweimer, tao
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-08-03 03:42:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	391511

Description Alan Matsuoka 2008-03-05 14:36:06 UTC

Description of problem:
On RHEL 4, when the included program is compiled as a 64 bit binary and executed
on a system equipped with Intel processors, performance is considerably worse
than when the same program is compiled as a 32 bit binary and executed on the
same machine.  Execution times on 64 bit AMD hardware are virtually the same for
the 64 and 32 bit binaries.


How reproducible:
Customer is noting a much larger impact on his specific hardware platform
(roughly 4x as slow for 64 bit than 32 bit).  I am able to reproduce a
significant performance difference on hardware I have tested, but it is in the
scope of 2x as slow.


Steps to Reproduce:
(I have two lab machines reserved to reproduce this, if you need the addresses
to confirm)

# gcc -O3 sorttest.c -lm -o sorttest64
# gcc -O3 sorttest.c -lm -m32 -o sorttest32
# time ./sorttest64 20
# time ./sorttest32 20

Compare resulting execution times on Intel and AMD hardware.


Actual results:

on Intel hardware:

# time ./sorttest64 20

real    3m30.056s
user    3m29.295s
sys     0m0.724s
# time ./sorttest32 20

real    1m30.943s
user    1m30.164s
sys     0m0.762s


on AMD hardware:

# time ./sorttest64 20

real    2m20.643s
user    2m19.949s
sys     0m0.644s
# time ./sorttest32 20

real    2m0.964s
user    2m0.070s
sys     0m0.850s

Since these are different boxes with different specs, the actual execution time
differences from one box to the other are not important.  Rather, the important
thing to note is how much better the performance of the 32 bit code is compared
to the 64 bit code on the Intel hardware, while the AMD hardware shows very
little difference.


Expected results:

Not such a huge performance hit for the 64 bit code on Intel hardware.


Additional info:

Customer has tested this on the same Intel hardware with RHEL 4, 5, two versions
of SuSe, F8, and a few other distributions.  The poor performance of the 64 bit
code is only apparent in RHEL.

To try to further flush this out, I copied:
/lib64/ld-linux-x86-64.so.2
/lib64/libc.so.6
/lib64/libm.so.6

from a 64 bit F8 box to /root/f8 on the same Intel box that I ran the above
tests on.  I then ran:

# time /root/f8/ld-linux-x86-64.so.2 --library-path /root/f8 /root/sorttest64 20

real    1m15.522s
user    1m14.777s
sys     0m0.732s

With the above, it looks like I can also reproduce the customer's statement that
this is not a problem on F8.

-----------------------------------------------------------

/* compile:  cc sorttest.c -O3 -lm -o sorttest  */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define N 10000000 /* vector of 10-M */

int scr[N], i, cnt, n, posx[N], srt_f(const void *, const void *);

double a, aaa, posy[N];

int srt_f(const void *a, const void *b)
{ aaa = posy[*((int *)a)] - posy[*((int *)b)];
if ( aaa < 0. ) return(-1);
return( aaa > 0. ); }

int main(int argc, char *argv[])
{ cnt = ( argc == 1 ) ? 1 : atoi(argv[1]);
for ( n = 0; n < cnt; ++n )
{ for ( i = 0; i < N; ++i )
  { posx[i] = i;
    a += .001;
    posy[i] = sin(a); }
  qsort((void *) posx, i, sizeof(i), srt_f);
  for ( i = 0; i < N; ++i )
    scr[posx[i]] = (1000*i)/N; } }


Comment #1 From Alan Matsuoka (alanm) 	on 2008-02-27 15:01 EST 
[reply] 	Private

OK. Here's the short answer. The performance problem that they are seeing with
the RHEL, x86_64 combination and not with Fedora is quite simple. There was a
change in libc for Fedora Core 8 for msort and qsort.

http://sourceware.org/cgi-bin/cvsweb.cgi/libc/ChangeLog.diff?cvsroot=glibc&only_with_tag=fedora-branch&r1=1.8782.2.274&r2=1.8782.2.275
+2007-10-04  Jakub Jelinek  <jakub>
+
+ * stdlib/msort.c: Include stdint.h.
+ (struct msort_param): New type.
+ (msort_with_tmp): Use struct msort_param pointer for unchanging
+ parameters.  Add optimized handling for several common sizes
+ and indirect sorting mode.
+ (qsort): Adjust msort_with_tmp callers.  For big S use indirect
+ sorting.
+ Suggested by Belazougui Djamel .
+
+ * stdlib/Makefile (tests): Add tst-qsort2.
+ * stdlib/tst-qsort2.c: New test.

http://sourceware.org/cgi-bin/cvsweb.cgi/libc/stdlib/msort.c.diff?cvsroot=glibc&only_with_tag=fedora-branch&r1=1.21&r2=1.21.2.1

see : http://www.cygwin.com/ml/libc-alpha/2007-05/msg00022.html
      for the rationale

Doing a bit of hacking and code substitution with the FC 8 version of msort and
qsort

gcc -O3 -I. -g -o sorttestnew sorttest.c msortnew.c qsortnew.c -lm
msortnew.c: In function ‘msort_with_tmp’:
msortnew.c:142: warning: cast to pointer from integer of different size
msortnew.c:148: warning: cast to pointer from integer of different size

[alanm@dt1 TEST]$ time ./sorttestnew 20
66.331u 0.637s 1:07.10 99.7%    0+0k 0+0io 0pf+0w
[alanm@dt1 TEST]$ gcc -m32 -O3 -I. -g -o sorttestnew.m32 sorttest.c msortnew.c
qsortnew.c -lm
[alanm@dt1 TEST]$ time ./sorttestnew.m32 20
69.026u 0.659s 1:09.71 99.9%    0+0k 0+0io 2pf+0w

## This is the RHEL5 libc
[alanm@dt1 TEST]$ time ./sorttest 20
187.127u 0.642s 3:07.80 99.9%   0+0k 0+0io 0pf+0w
[alanm@dt1 TEST]$ time ./sorttest.m32 20
71.298u 0.683s 1:12.12 99.7%    0+0k 0+0io 4pf+0w


Note that there is a 3X improvement in 64 bit mode for this particular testcase.
This would be a better result than you would get because qsort and msort are not
compiled with -fPIC in the hacked together case.

I'll post the longer answer (how I arrived at this conclusion later today).

Customer has confirmed that these test packages appear to provide much
better performance and resolve their issues.  I'm assuming that they
simply extracted /lib64/ld-linux-x86-64.so.2, /lib64/libc.so.6, and
/lib64/libm.so.6 from the glibc package though, because they are reporting
a dependency problem when trying to upgrade to these test packages that
appears to be related to the 32 bit versions of glibc they have
installed.

None the less, I don't see anything immediately wrong with their method
of testing, so it still seems valid that this fix will work for them.  If
you want me to have them test by actually installing the rpms on their
system, we'll just have to solve these dependencies (glibc-common =
2.5-18.el5_1.1 is needed by (installed) glibc-2.5-18.el5_1.1.i686).

Comment 2 Ulrich Drepper 2008-08-03 03:42:50 UTC

Alan, you have to provide a really good reason for us to take this high risk.  I'm closing the bug.  If you have a really good reason reopen te bug.