435182 – [RHEL5] poor 64 bit Intel performance on sorting program

Bug 435182 - [RHEL5] poor 64 bit Intel performance on sorting program

Summary: [RHEL5] poor 64 bit Intel performance on sorting program

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jakub Jelinek
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	391501
TreeView+	depends on / blocked

Reported:	2008-02-27 20:00 UTC by Alan Matsuoka
Modified:	2018-10-20 01:18 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 20:50:12 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:0080	0	normal	SHIPPED_LIVE	glibc bug fix and enhancement update	2009-01-20 16:04:18 UTC

Description Alan Matsuoka 2008-02-27 20:00:31 UTC

Description of problem:
On RHEL 5, when the included program is compiled as a 64 bit binary and executed
on a system equipped with Intel processors, performance is considerably worse
than when the same program is compiled as a 32 bit binary and executed on the
same machine.  Execution times on 64 bit AMD hardware are virtually the same for
the 64 and 32 bit binaries.


How reproducible:
Customer is noting a much larger impact on his specific hardware platform
(roughly 4x as slow for 64 bit than 32 bit).  I am able to reproduce a
significant performance difference on hardware I have tested, but it is in the
scope of 2x as slow.


Steps to Reproduce:
(I have two lab machines reserved to reproduce this, if you need the addresses
to confirm)

# gcc -O3 sorttest.c -lm -o sorttest64
# gcc -O3 sorttest.c -lm -m32 -o sorttest32
# time ./sorttest64 20
# time ./sorttest32 20

Compare resulting execution times on Intel and AMD hardware.


Actual results:

on Intel hardware:

# time ./sorttest64 20

real    3m30.056s
user    3m29.295s
sys     0m0.724s
# time ./sorttest32 20

real    1m30.943s
user    1m30.164s
sys     0m0.762s


on AMD hardware:

# time ./sorttest64 20

real    2m20.643s
user    2m19.949s
sys     0m0.644s
# time ./sorttest32 20

real    2m0.964s
user    2m0.070s
sys     0m0.850s

Since these are different boxes with different specs, the actual execution time
differences from one box to the other are not important.  Rather, the important
thing to note is how much better the performance of the 32 bit code is compared
to the 64 bit code on the Intel hardware, while the AMD hardware shows very
little difference.


Expected results:

Not such a huge performance hit for the 64 bit code on Intel hardware.


Additional info:

Customer has tested this on the same Intel hardware with RHEL 4, 5, two versions
of SuSe, F8, and a few other distributions.  The poor performance of the 64 bit
code is only apparent in RHEL.

To try to further flush this out, I copied:
/lib64/ld-linux-x86-64.so.2
/lib64/libc.so.6
/lib64/libm.so.6

from a 64 bit F8 box to /root/f8 on the same Intel box that I ran the above
tests on.  I then ran:

# time /root/f8/ld-linux-x86-64.so.2 --library-path /root/f8 /root/sorttest64 20

real    1m15.522s
user    1m14.777s
sys     0m0.732s

With the above, it looks like I can also reproduce the customer's statement that
this is not a problem on F8.

-----------------------------------------------------------

/* compile:  cc sorttest.c -O3 -lm -o sorttest  */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define N 10000000 /* vector of 10-M */

int scr[N], i, cnt, n, posx[N], srt_f(const void *, const void *);

double a, aaa, posy[N];

int srt_f(const void *a, const void *b)
{ aaa = posy[*((int *)a)] - posy[*((int *)b)];
if ( aaa < 0. ) return(-1);
return( aaa > 0. ); }

int main(int argc, char *argv[])
{ cnt = ( argc == 1 ) ? 1 : atoi(argv[1]);
for ( n = 0; n < cnt; ++n )
{ for ( i = 0; i < N; ++i )
  { posx[i] = i;
    a += .001;
    posy[i] = sin(a); }
  qsort((void *) posx, i, sizeof(i), srt_f);
  for ( i = 0; i < N; ++i )
    scr[posx[i]] = (1000*i)/N; } }

Comment 1 Alan Matsuoka 2008-02-27 20:01:07 UTC

OK. Here's the short answer. The performance problem that they are seeing with
the RHEL, x86_64 combination and not with Fedora is quite simple. There was a
change in libc for Fedora Core 8 for msort and qsort.

http://sourceware.org/cgi-bin/cvsweb.cgi/libc/ChangeLog.diff?cvsroot=glibc&only_with_tag=fedora-branch&r1=1.8782.2.274&r2=1.8782.2.275
+2007-10-04  Jakub Jelinek  <jakub>
+
+ * stdlib/msort.c: Include stdint.h.
+ (struct msort_param): New type.
+ (msort_with_tmp): Use struct msort_param pointer for unchanging
+ parameters.  Add optimized handling for several common sizes
+ and indirect sorting mode.
+ (qsort): Adjust msort_with_tmp callers.  For big S use indirect
+ sorting.
+ Suggested by Belazougui Djamel .
+
+ * stdlib/Makefile (tests): Add tst-qsort2.
+ * stdlib/tst-qsort2.c: New test.

http://sourceware.org/cgi-bin/cvsweb.cgi/libc/stdlib/msort.c.diff?cvsroot=glibc&only_with_tag=fedora-branch&r1=1.21&r2=1.21.2.1

see : http://www.cygwin.com/ml/libc-alpha/2007-05/msg00022.html
      for the rationale

Doing a bit of hacking and code substitution with the FC 8 version of msort and
qsort

gcc -O3 -I. -g -o sorttestnew sorttest.c msortnew.c qsortnew.c -lm
msortnew.c: In function ‘msort_with_tmp’:
msortnew.c:142: warning: cast to pointer from integer of different size
msortnew.c:148: warning: cast to pointer from integer of different size

[alanm@dt1 TEST]$ time ./sorttestnew 20
66.331u 0.637s 1:07.10 99.7%    0+0k 0+0io 0pf+0w
[alanm@dt1 TEST]$ gcc -m32 -O3 -I. -g -o sorttestnew.m32 sorttest.c msortnew.c
qsortnew.c -lm
[alanm@dt1 TEST]$ time ./sorttestnew.m32 20
69.026u 0.659s 1:09.71 99.9%    0+0k 0+0io 2pf+0w

## This is the RHEL5 libc
[alanm@dt1 TEST]$ time ./sorttest 20
187.127u 0.642s 3:07.80 99.9%   0+0k 0+0io 0pf+0w
[alanm@dt1 TEST]$ time ./sorttest.m32 20
71.298u 0.683s 1:12.12 99.7%    0+0k 0+0io 4pf+0w


Note that there is a 3X improvement in 64 bit mode for this particular testcase.
This would be a better result than you would get because qsort and msort are not
compiled with -fPIC in the hacked together case.

I'll post the longer answer (how I arrived at this conclusion later today).

Comment 3 RHEL Program Management 2008-06-02 20:16:29 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 errata-xmlrpc 2009-01-20 20:50:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0080.html

Comment 11 Futaya Yamazaki 2009-02-12 23:47:08 UTC

Do you think that this problem applys on RHEL4, too?
If it is so, do you have any plan to fix it on RHEL4?
We are having the same experience on RHEL4.

Comment 12 Chris Marcantonio 2009-02-13 16:32:39 UTC

Yes, this problem does apply to RHEL4 as well.  The BZ filed against RHEL4 can be found here:

https://bugzilla.redhat.com/show_bug.cgi?id=436115

For RHEL4, this bug is CLOSED WONTFIX, as it is very late in the RHEL4 lifecycle.

Note You need to log in before you can comment on or make changes to this bug.