Bug 713586

Summary: [RHEL6:GCC-4.6] performance regression in gcc46 compiler
Product: Red Hat Enterprise Linux 6 Reporter: Travis Gummels <tgummels>
Component: gccAssignee: Jakub Jelinek <jakub>
Status: CLOSED WONTFIX QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: arozansk, jwest, law, patrickm, pmuller, rth, woodard
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-01 14:55:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Reproducer files. none

Description Travis Gummels 2011-06-15 20:34:07 UTC
Created attachment 504936 [details]
Reproducer files.

Description of problem:

While verifying that the rhds gcc46 compiler fixes a problem that it is supposed to fix. I noticed that the user's reproducer showed a performance regression on intel rather but not on AMD. Since this reproducer is highly indicative of the coding style used by the C++ developers at LLNL and they are expected to use the 4.6 when it is more widely available to work around the problem with the 4.4 compiler that the reproducer was written for, we need to look into this. Especially since LLNL primarly uses Intel while other labs use AMD

I already mentioned the problem to Jakub and he told me to make a BZ and get it to him.

AMD:
[ben@mandy Keasler]$ ./STLtest;./STLtest46
STL time is 2710000
STL restrict time is 2710000
PTR (no restrict) time is 2700000
PTR wrong restrict time is 1950000
PTR restrict time is 2010000
STL time is 2700000
STL restrict time is 1900000
PTR (no restrict) time is 2650000
PTR wrong restrict time is 1990000
PTR restrict time is 2870000

Intel:
[ben@snog Keasler]$ ./STLtest;./STLtest46
STL time is 5230000
STL restrict time is 5380000
PTR (no restrict) time is 5130000
PTR wrong restrict time is 4020000
PTR restrict time is 5370000
STL time is 6010000
STL restrict time is 5740000
PTR (no restrict) time is 6010000
PTR wrong restrict time is 4860000
PTR restrict time is 5990000

The actual numbers are incomparable but the relative numbers between the runs is important.
So on AMD with 4.4.5 and 4.6 STL takes: 2710000 2700000 very close.
Bun on Intel 4.4.5 takes: 5230000 and 4.6 takes: 6010000

Most of the other values show a similar pattern which kind of indicates that 4.6 isn't doing as good of a job optimizing on Intel as 4.4.5.
Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Run the program in Keasler2.tar.gz as noted above.

Actual results:
Performance regression

Expected results:
No performance regression

Additional info:

Business Justification:

LLNL develops, distributes and uses a HPC distribution named CHAOS based on RHEL.  They are currently integrating RHEL 6 in to CHAOS 5.  The end users need bleeding edge everything.  GCC 4.6 is supposed to solve a number off issues and provide functionality that CHAOS HPC end users have been screaming for.  A performance regression isn't going to be troublesome for the end users, LLNL and to an extent Red Hat.

Comment 1 Jakub Jelinek 2011-06-16 11:01:11 UTC
So far I've looked just at ptr_c_gr.  Here is a self-contained testcase for that:
typedef double __attribute__ ((aligned (16))) D;
__attribute__((noinline, noclone))
void foo (D *out1, D *out2, D *out3, D *in1, D *in2, int len)
{
  for (int i = 0; i < len; ++i)
    {
      out1[i] = in1[i] * in2[i];
      out2[i] = in1[i] + in2[i];
      out3[i] = in1[i] - in2[i];
    }
}
#ifndef N
#define D 1
#define N 16
#endif
typedef union { int i __attribute__((aligned (N))); double d[0]; } U;
__attribute__((noinline, noclone))
void bar (U *out1, U *out2, U *out3, U *in1, U *in2, int len)
{
  for (int i = 0; i < len; ++i)
    {
      out1->d[i] = in1->d[i] * in2->d[i];
      out2->d[i] = in1->d[i] + in2->d[i];
      out3->d[i] = in1->d[i] - in2->d[i];
    }
}
double a[50000] __attribute__((aligned (32)));
int
main ()
{
  int i;
  for (i = 0; i < 500000; i++)
#ifdef D
    foo (a + 0, a + 10000, a + 20000, a + 30000, a + 40000, 10000);
#else
    bar ((U *) (a + 0), (U *) (a + 10000), (U *) (a + 20000),
         (U *) (a + 30000), (U *) (a + 40000), 10000);
#endif
  return 0;
}

for i in "" "-DN=16" "-DN=32"; do \
  for j in "/usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet" \
           "g++ -S" "g++ -S -mavx"; do \
    echo "$i" "$j"; eval $j -O3 -mtune=generic $i foo.c -o foo.s; \
    g++ -o foo foo.s; ./foo; ( time ./foo ) 2>&1 | grep user; \
  done; \
done
 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user	0m6.369s
 g++ -S
user	0m7.803s
 g++ -S -mavx
user	0m7.648s
-DN=16 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user	0m6.310s
-DN=16 g++ -S
user	0m6.489s
-DN=16 g++ -S -mavx
user	0m5.442s
-DN=32 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user	0m6.325s
-DN=32 g++ -S
user	0m6.156s
-DN=32 g++ -S -mavx
user	0m5.506s

In the case where N isn't defined, i.e. the original ptr_c_gr, I think the main difference is that g++ 4.4 used to vectorize it, but 4.6 doesn't.
It stopped being vectorized at:
http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=157088
aka http://gcc.gnu.org/PR43188
The question is if this is a way we want to support as a way to say the arguments are aligned.  The testcase contains a different way of saying the same, which apparently works.  Talked about this on IRC and the result of the
discussion was that the best thing would be add a new attribute, ptr_align,
where __attribute__((ptr_align (16))) would mean the parameter is 16 byte aligned.  Or it could be represented as if ((uintptr_t) out1) & 15) __builtin_unreachable (); - kind of assert alternative that would generate no code.

When the loop is vectorized with both old and new compiler, not sure if the differences aren't just measurement errors.  I've run it again with 2000000
instead of 500000 iterations and got:
 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user	0m25.277s
 g++ -S
user	0m31.888s
 g++ -S -mavx
user	0m30.448s
-DN=16 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user	0m25.444s
-DN=16 g++ -S
user	0m25.388s
-DN=16 g++ -S -mavx
user	0m21.580s
-DN=32 /usr/src/gcc-4.4-rh/obj/gcc/cc1plus -w -quiet
user	0m25.307s
-DN=32 g++ -S
user	0m25.461s
-DN=32 g++ -S -mavx
user	0m22.778s

Comment 2 Jakub Jelinek 2011-06-16 11:20:25 UTC
The following is vectorized in both cases, but seems to be reproduceably slower with 4.6:
__attribute__((noinline, noclone))
void baz (double *out1, double *out2, double *out3, double *in1, double *in2, int len)
{
  for (int i = 0; i < len; ++i)
    {
      out1[i] = in1[i] * in2[i];
      out2[i] = in1[i] + in2[i];
      out3[i] = in1[i] - in2[i];
    }
}

double a[50000] __attribute__((aligned (32)));
int
main ()
{
  int i;
  for (i = 0; i < 500000; i++)
    baz (a + 0, a + 10000, a + 20000, a + 30000, a + 40000, 10000);
  return 0;
}

4.4:
Strip out best and worst realtime result
minimum: 6.645898640 sec real / 0.000062341 sec CPU
maximum: 6.921587077 sec real / 0.000157156 sec CPU
average: 6.746931725 sec real / 0.000134046 sec CPU
stdev  : 0.076195043 sec real / 0.000021599 sec CPU

4.6:
Strip out best and worst realtime result
minimum: 6.947258042 sec real / 0.000073529 sec CPU
maximum: 7.463546534 sec real / 0.000160966 sec CPU
average: 7.225394743 sec real / 0.000138264 sec CPU
stdev  : 0.113974332 sec real / 0.000015305 sec CPU

Comment 3 Jakub Jelinek 2011-06-16 14:29:00 UTC
On the #c2 testcase the slowdown seems to be caused by
http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=148211
i.e. misaligned store support in the vectorizer:
http://gcc.gnu.org/ml/gcc-patches/2009-06/msg00492.html
The patch talks about need to add a cost model, not sure if it has been added for this.

Comment 4 Jakub Jelinek 2011-07-08 12:21:35 UTC
__builtin_assume_aligned is now backported to 4.6-RH in gcc-4.6.1-2.fc{15,16} as well as gcc46-4.6.1-2.el{5,6}.

Comment 5 Jakub Jelinek 2011-07-19 10:58:42 UTC
For GCC 4.4-RH __builtin_assume_aligned isn't really backportable though, it heavily relies on bit tracking in CCP, which has been only introduced in GCC 4.6.

Comment 11 RHEL Program Management 2011-09-01 14:55:12 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.