| Summary: | [RHEL6:GCC-4.6] performance regression in gcc46 compiler | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Travis Gummels <tgummels> | ||||
| Component: | gcc | Assignee: | Jakub Jelinek <jakub> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | qe-baseos-tools-bugs | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 6.1 | CC: | arozansk, jwest, law, patrickm, pmuller, rth, woodard | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-09-01 14:55:12 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
So far I've looked just at ptr_c_gr. Here is a self-contained testcase for that:
typedef double __attribute__ ((aligned (16))) D;
__attribute__((noinline, noclone))
void foo (D *out1, D *out2, D *out3, D *in1, D *in2, int len)
{
for (int i = 0; i < len; ++i)
{
out1[i] = in1[i] * in2[i];
out2[i] = in1[i] + in2[i];
out3[i] = in1[i] - in2[i];
}
}
#ifndef N
#define D 1
#define N 16
#endif
typedef union { int i __attribute__((aligned (N))); double d[0]; } U;
__attribute__((noinline, noclone))
void bar (U *out1, U *out2, U *out3, U *in1, U *in2, int len)
{
for (int i = 0; i < len; ++i)
{
out1->d[i] = in1->d[i] * in2->d[i];
out2->d[i] = in1->d[i] + in2->d[i];
out3->d[i] = in1->d[i] - in2->d[i];
}
}
double a[50000] __attribute__((aligned (32)));
int
main ()
{
int i;
for (i = 0; i < 500000; i++)
#ifdef D
foo (a + 0, a + 10000, a + 20000, a + 30000, a + 40000, 10000);
#else
bar ((U *) (a + 0), (U *) (a + 10000), (U *) (a + 20000),
(U *) (a + 30000), (U *) (a + 40000), 10000);
#endif
return 0;
}
for i in "" "-DN=16" "-DN=32"; do \
for j in "/usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet" \
"g++ -S" "g++ -S -mavx"; do \
echo "$i" "$j"; eval $j -O3 -mtune=generic $i foo.c -o foo.s; \
g++ -o foo foo.s; ./foo; ( time ./foo ) 2>&1 | grep user; \
done; \
done
/usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user 0m6.369s
g++ -S
user 0m7.803s
g++ -S -mavx
user 0m7.648s
-DN=16 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user 0m6.310s
-DN=16 g++ -S
user 0m6.489s
-DN=16 g++ -S -mavx
user 0m5.442s
-DN=32 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user 0m6.325s
-DN=32 g++ -S
user 0m6.156s
-DN=32 g++ -S -mavx
user 0m5.506s
In the case where N isn't defined, i.e. the original ptr_c_gr, I think the main difference is that g++ 4.4 used to vectorize it, but 4.6 doesn't.
It stopped being vectorized at:
http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=157088
aka http://gcc.gnu.org/PR43188
The question is if this is a way we want to support as a way to say the arguments are aligned. The testcase contains a different way of saying the same, which apparently works. Talked about this on IRC and the result of the
discussion was that the best thing would be add a new attribute, ptr_align,
where __attribute__((ptr_align (16))) would mean the parameter is 16 byte aligned. Or it could be represented as if ((uintptr_t) out1) & 15) __builtin_unreachable (); - kind of assert alternative that would generate no code.
When the loop is vectorized with both old and new compiler, not sure if the differences aren't just measurement errors. I've run it again with 2000000
instead of 500000 iterations and got:
/usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user 0m25.277s
g++ -S
user 0m31.888s
g++ -S -mavx
user 0m30.448s
-DN=16 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user 0m25.444s
-DN=16 g++ -S
user 0m25.388s
-DN=16 g++ -S -mavx
user 0m21.580s
-DN=32 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user 0m25.307s
-DN=32 g++ -S
user 0m25.461s
-DN=32 g++ -S -mavx
user 0m22.778s
The following is vectorized in both cases, but seems to be reproduceably slower with 4.6:
__attribute__((noinline, noclone))
void baz (double *out1, double *out2, double *out3, double *in1, double *in2, int len)
{
for (int i = 0; i < len; ++i)
{
out1[i] = in1[i] * in2[i];
out2[i] = in1[i] + in2[i];
out3[i] = in1[i] - in2[i];
}
}
double a[50000] __attribute__((aligned (32)));
int
main ()
{
int i;
for (i = 0; i < 500000; i++)
baz (a + 0, a + 10000, a + 20000, a + 30000, a + 40000, 10000);
return 0;
}
4.4:
Strip out best and worst realtime result
minimum: 6.645898640 sec real / 0.000062341 sec CPU
maximum: 6.921587077 sec real / 0.000157156 sec CPU
average: 6.746931725 sec real / 0.000134046 sec CPU
stdev : 0.076195043 sec real / 0.000021599 sec CPU
4.6:
Strip out best and worst realtime result
minimum: 6.947258042 sec real / 0.000073529 sec CPU
maximum: 7.463546534 sec real / 0.000160966 sec CPU
average: 7.225394743 sec real / 0.000138264 sec CPU
stdev : 0.113974332 sec real / 0.000015305 sec CPU
On the #c2 testcase the slowdown seems to be caused by http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=148211 i.e. misaligned store support in the vectorizer: http://gcc.gnu.org/ml/gcc-patches/2009-06/msg00492.html The patch talks about need to add a cost model, not sure if it has been added for this. __builtin_assume_aligned is now backported to 4.6-RH in gcc-4.6.1-2.fc{15,16} as well as gcc46-4.6.1-2.el{5,6}.
For GCC 4.4-RH __builtin_assume_aligned isn't really backportable though, it heavily relies on bit tracking in CCP, which has been only introduced in GCC 4.6. Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. |
Created attachment 504936 [details] Reproducer files. Description of problem: While verifying that the rhds gcc46 compiler fixes a problem that it is supposed to fix. I noticed that the user's reproducer showed a performance regression on intel rather but not on AMD. Since this reproducer is highly indicative of the coding style used by the C++ developers at LLNL and they are expected to use the 4.6 when it is more widely available to work around the problem with the 4.4 compiler that the reproducer was written for, we need to look into this. Especially since LLNL primarly uses Intel while other labs use AMD I already mentioned the problem to Jakub and he told me to make a BZ and get it to him. AMD: [ben@mandy Keasler]$ ./STLtest;./STLtest46 STL time is 2710000 STL restrict time is 2710000 PTR (no restrict) time is 2700000 PTR wrong restrict time is 1950000 PTR restrict time is 2010000 STL time is 2700000 STL restrict time is 1900000 PTR (no restrict) time is 2650000 PTR wrong restrict time is 1990000 PTR restrict time is 2870000 Intel: [ben@snog Keasler]$ ./STLtest;./STLtest46 STL time is 5230000 STL restrict time is 5380000 PTR (no restrict) time is 5130000 PTR wrong restrict time is 4020000 PTR restrict time is 5370000 STL time is 6010000 STL restrict time is 5740000 PTR (no restrict) time is 6010000 PTR wrong restrict time is 4860000 PTR restrict time is 5990000 The actual numbers are incomparable but the relative numbers between the runs is important. So on AMD with 4.4.5 and 4.6 STL takes: 2710000 2700000 very close. Bun on Intel 4.4.5 takes: 5230000 and 4.6 takes: 6010000 Most of the other values show a similar pattern which kind of indicates that 4.6 isn't doing as good of a job optimizing on Intel as 4.4.5. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Run the program in Keasler2.tar.gz as noted above. Actual results: Performance regression Expected results: No performance regression Additional info: Business Justification: LLNL develops, distributes and uses a HPC distribution named CHAOS based on RHEL. They are currently integrating RHEL 6 in to CHAOS 5. The end users need bleeding edge everything. GCC 4.6 is supposed to solve a number off issues and provide functionality that CHAOS HPC end users have been screaming for. A performance regression isn't going to be troublesome for the end users, LLNL and to an extent Red Hat.