Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
gcc4.4.5 in rhel6.1 generates slower code than gcc4.1.2 in rhel5.6
Version-Release number of selected component (if applicable):
4.4.5 20110214 in rhel6.1 and
How reproducible:
100%
Steps to Reproduce:
1.download source code (attached) of stream benchmark
2.compile it with gcc -02 stream.c -o stream on rhel5.6 ,6.0 and 6.1
3.run stream benchmark. ./stream > stream_out
5.compare results on same machine.
Actual results:
RHEL5.6:
Function Rate (MB/s) Avg time Min time Max time
Copy: 5351.7121 0.0308 0.0299 0.0317
Scale: 5398.6392 0.0306 0.0296 0.0313
Add: 6120.4283 0.0400 0.0392 0.0407
Triad: 6155.9105 0.0398 0.0390 0.0405
ALL: 5798.0225 0.1412 0.1380 0.1439
RHEL6.0
Function Rate (MB/s) Avg time Min time Max time
Copy: 1402.1940 0.1143 0.1141 0.1145
Scale: 1458.0452 0.1098 0.1097 0.1101
Add: 1481.1158 0.1622 0.1620 0.1624
Triad: 1460.0755 0.1645 0.1644 0.1647
ALL: 1453.2925 0.5507 0.5505 0.5515
RHEL6.1
Function Rate (MB/s) Avg time Min time Max time
Copy: 1399.7752 0.1144 0.1143 0.1145
Scale: 1456.9816 0.1099 0.1098 0.1102
Add: 1479.1440 0.1624 0.1623 0.1626
Triad: 1458.4001 0.1647 0.1646 0.1648
ALL: 1451.7734 0.5514 0.5511 0.5522
code generated on rhel6.x has only ~25% performance of code generated on 5.6.
Expected results:
performance of generated code improved to rhel5.6 levels
Additional info:
Attachment contains:
stream.c,stream.h - source code of stream benchmark
stream56,stream60,stream61 - binary compiled on rhel5.6,6.0,6.1
stream56_out, stream60_out, stream61_out - outputs of single run of benchmark
performance tested on
ibm-js22-vios-02-lp1.rhts.eng.bos.redhat.com
Seems lfdx/stfdx are horribly slow on power6.
It is something for IBM to figure out, current GCC 4.6 behaves exactly the same.
Small testcase:
#define N 10000000
double a[N], b[N], c[N];
__attribute__((noinline))
void foo ()
{
int j;
for (j=0; j<N; j++)
c[j] = a[j];
}
int
main ()
{
int i;
for (i = 0; i < 50; i++)
foo ();
return 0;
}
compile with -m32 -O2 -mtune=power6 or -m32 -O3 -mtune=power6.
With gcc 4.1.2 the inner loop in foo is:
.L2:
lfd 0,0(9)
addi 9,9,8
stfd 0,0(11)
addi 11,11,8
bdnz .L2
while with 4.4-RH as well as current 4.6 trunk:
.L2:
lfdx 0,11,9
stfdx 0,10,9
addi 9,9,8
bdnz .L2
time ./test-4.1
real 0m1.118s
user 0m1.090s
sys 0m0.026s
time ./test-4.4
real 0m5.395s
user 0m5.348s
sys 0m0.038s
Ok, apparently if you compile with -mcpu=power6 (i.e. both tune for power6 (that's the default in RHEL6) and stop supporting older CPUs), then
-mavoid-indexed-addresses is used by default and this penalty on power6 is no longer present. Or you can explicitly compile with -mavoid-indexed-addresses.