Hide Forgot
Description of problem: gcc4.4.5 in rhel6.1 generates slower code than gcc4.1.2 in rhel5.6 Version-Release number of selected component (if applicable): 4.4.5 20110214 in rhel6.1 and How reproducible: 100% Steps to Reproduce: 1.download source code (attached) of stream benchmark 2.compile it with gcc -02 stream.c -o stream on rhel5.6 ,6.0 and 6.1 3.run stream benchmark. ./stream > stream_out 5.compare results on same machine. Actual results: RHEL5.6: Function Rate (MB/s) Avg time Min time Max time Copy: 5351.7121 0.0308 0.0299 0.0317 Scale: 5398.6392 0.0306 0.0296 0.0313 Add: 6120.4283 0.0400 0.0392 0.0407 Triad: 6155.9105 0.0398 0.0390 0.0405 ALL: 5798.0225 0.1412 0.1380 0.1439 RHEL6.0 Function Rate (MB/s) Avg time Min time Max time Copy: 1402.1940 0.1143 0.1141 0.1145 Scale: 1458.0452 0.1098 0.1097 0.1101 Add: 1481.1158 0.1622 0.1620 0.1624 Triad: 1460.0755 0.1645 0.1644 0.1647 ALL: 1453.2925 0.5507 0.5505 0.5515 RHEL6.1 Function Rate (MB/s) Avg time Min time Max time Copy: 1399.7752 0.1144 0.1143 0.1145 Scale: 1456.9816 0.1099 0.1098 0.1102 Add: 1479.1440 0.1624 0.1623 0.1626 Triad: 1458.4001 0.1647 0.1646 0.1648 ALL: 1451.7734 0.5514 0.5511 0.5522 code generated on rhel6.x has only ~25% performance of code generated on 5.6. Expected results: performance of generated code improved to rhel5.6 levels Additional info: Attachment contains: stream.c,stream.h - source code of stream benchmark stream56,stream60,stream61 - binary compiled on rhel5.6,6.0,6.1 stream56_out, stream60_out, stream61_out - outputs of single run of benchmark performance tested on ibm-js22-vios-02-lp1.rhts.eng.bos.redhat.com
Created attachment 483428 [details] source, compiled binary, and test outputs
Seems lfdx/stfdx are horribly slow on power6. It is something for IBM to figure out, current GCC 4.6 behaves exactly the same. Small testcase: #define N 10000000 double a[N], b[N], c[N]; __attribute__((noinline)) void foo () { int j; for (j=0; j<N; j++) c[j] = a[j]; } int main () { int i; for (i = 0; i < 50; i++) foo (); return 0; } compile with -m32 -O2 -mtune=power6 or -m32 -O3 -mtune=power6. With gcc 4.1.2 the inner loop in foo is: .L2: lfd 0,0(9) addi 9,9,8 stfd 0,0(11) addi 11,11,8 bdnz .L2 while with 4.4-RH as well as current 4.6 trunk: .L2: lfdx 0,11,9 stfdx 0,10,9 addi 9,9,8 bdnz .L2 time ./test-4.1 real 0m1.118s user 0m1.090s sys 0m0.026s time ./test-4.4 real 0m5.395s user 0m5.348s sys 0m0.038s
Ok, apparently if you compile with -mcpu=power6 (i.e. both tune for power6 (that's the default in RHEL6) and stop supporting older CPUs), then -mavoid-indexed-addresses is used by default and this penalty on power6 is no longer present. Or you can explicitly compile with -mavoid-indexed-addresses.