Bug 683808

Summary: gcc4.4.5 in rhel6.1 generates slower code than gcc4.1.2 in rhel5.6 on power6
Product: Red Hat Enterprise Linux 6 Reporter: Adam Okuliar <aokuliar>
Component: gccAssignee: Jakub Jelinek <jakub>
Status: CLOSED NOTABUG QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.1   
Target Milestone: rc   
Target Release: ---   
Hardware: ppc   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-10 15:01:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
source, compiled binary, and test outputs none

Description Adam Okuliar 2011-03-10 12:25:25 UTC
Description of problem:
gcc4.4.5 in rhel6.1 generates slower code than gcc4.1.2 in rhel5.6

Version-Release number of selected component (if applicable):
4.4.5 20110214 in rhel6.1 and 

How reproducible:
100%

Steps to Reproduce:
1.download source code (attached) of stream benchmark
2.compile it with gcc -02 stream.c -o stream on rhel5.6 ,6.0 and 6.1
3.run stream benchmark. ./stream > stream_out
5.compare results on same machine.
  
Actual results:
RHEL5.6:
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        5351.7121       0.0308       0.0299       0.0317
Scale:       5398.6392       0.0306       0.0296       0.0313
Add:         6120.4283       0.0400       0.0392       0.0407
Triad:       6155.9105       0.0398       0.0390       0.0405
ALL:         5798.0225       0.1412       0.1380       0.1439

RHEL6.0
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1402.1940       0.1143       0.1141       0.1145
Scale:       1458.0452       0.1098       0.1097       0.1101
Add:         1481.1158       0.1622       0.1620       0.1624
Triad:       1460.0755       0.1645       0.1644       0.1647
ALL:         1453.2925       0.5507       0.5505       0.5515

RHEL6.1
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1399.7752       0.1144       0.1143       0.1145
Scale:       1456.9816       0.1099       0.1098       0.1102
Add:         1479.1440       0.1624       0.1623       0.1626
Triad:       1458.4001       0.1647       0.1646       0.1648
ALL:         1451.7734       0.5514       0.5511       0.5522

code generated on rhel6.x has only ~25% performance of code generated on 5.6.


Expected results:
performance of generated code improved to rhel5.6 levels

Additional info:
Attachment contains:
stream.c,stream.h - source code of stream benchmark
stream56,stream60,stream61 - binary compiled on rhel5.6,6.0,6.1
stream56_out, stream60_out, stream61_out - outputs of single run of benchmark

performance tested on 
ibm-js22-vios-02-lp1.rhts.eng.bos.redhat.com

Comment 1 Adam Okuliar 2011-03-10 12:28:28 UTC
Created attachment 483428 [details]
source, compiled binary, and test outputs

Comment 3 Jakub Jelinek 2011-03-10 14:31:49 UTC
Seems lfdx/stfdx are horribly slow on power6.
It is something for IBM to figure out, current GCC 4.6 behaves exactly the same.

Small testcase:

#define N 10000000
double a[N], b[N], c[N];

__attribute__((noinline))
void foo ()
{
  int j;
  for (j=0; j<N; j++)
    c[j] = a[j];
}

int
main ()
{
  int i;
  for (i = 0; i < 50; i++)
    foo ();
  return 0;
}

compile with -m32 -O2 -mtune=power6 or -m32 -O3 -mtune=power6.

With gcc 4.1.2 the inner loop in foo is:
.L2:
        lfd 0,0(9)
        addi 9,9,8
        stfd 0,0(11)
        addi 11,11,8
        bdnz .L2
while with 4.4-RH as well as current 4.6 trunk:
.L2:
        lfdx 0,11,9
        stfdx 0,10,9
        addi 9,9,8
        bdnz .L2

time ./test-4.1

real	0m1.118s
user	0m1.090s
sys	0m0.026s
time ./test-4.4

real	0m5.395s
user	0m5.348s
sys	0m0.038s

Comment 4 Jakub Jelinek 2011-03-10 15:01:07 UTC
Ok, apparently if you compile with -mcpu=power6 (i.e. both tune for power6 (that's the default in RHEL6) and stop supporting older CPUs), then
-mavoid-indexed-addresses is used by default and this penalty on power6 is no longer present.  Or you can explicitly compile with -mavoid-indexed-addresses.