Bug 1769410

Summary: Compiler vectorizes the loop incorrectly when the loop is inlined
Product: Red Hat Developer Toolset Reporter: Piyush Bhoot <pbhoot>
Component: gccAssignee: Marek Polacek <mpolacek>
Status: CLOSED ERRATA QA Contact: Alexandra Petlanová Hájková <ahajkova>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: DTS 9.1 RHEL 7CC: fweimer, jakub, kwalker, law, mcermak, mnewsome, mpolacek, ohudlick, tborcin
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: devtoolset-9-gcc-9.3.1-1.el7 Doc Type: Bug Fix
Doc Text:
Cause: A bug in the basic block vectorization handling in the vectorizer. Consequence: Vectorizing a loop with negative strides produced wrong code. Fix: Basic block vectorization has been amended. Result: The compiler produces code that executes correctly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-26 06:07:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
reproducer none

Description Piyush Bhoot 2019-11-06 15:05:47 UTC
Created attachment 1633342 [details]
reproducer

Description of problem:
Compiler vectorizes the loop incorrectly when the loop is inlined

Version-Release number of selected component (if applicable):
The bug is present in GCC 7 through 9

Reduced test case attached to show bug in vectorization loops.

Files included are:

t.C and tt.C - source
run - build script, note that uncommenting line "OPTS="-03 -mavx2" causes the
problem found in run1.txt.
run1.txt - bad run output
run2.txt - good run output - commenting the above option and uncommenting the
options tagged as fixes the problem.

Comment 2 Jakub Jelinek 2019-11-07 20:02:29 UTC
Seems to have started with http://gcc.gnu.org/r238039
Simplified testcase:
struct Cnum
{
  float re, im;
  Cnum (float re_ = 0.0, float im_ = 0.0) : re(re_), im(im_) {}
};

static inline void
f2 (const Cnum *src1, const Cnum *src2, Cnum *dest, int N)
{
  for (int i = 0; i < N; i++)
    {
      const Cnum & A = src1[i];
      const Cnum & B = *src2;
      dest[i].re = (A.re * B.re - A.im * B.im);
      dest[i].im = (A.im * B.re + A.re * B.im);
      src2--;
    }
}

__attribute__((noinline, noclone, noipa)) void
f1 (const Cnum *src1, const Cnum *src2, Cnum *dest, int N)
{
  return f2 (src1, src2, dest, N);
}

int
main ()
{
  const int N = 512;
  Cnum mask[N];
  Cnum filter[N];
  Cnum out1[N];
  Cnum out2[N];
  for (int i = 0; i < N; ++i)
    {
      mask[i].re = (-0.4 * i + N * 0.30) / N;
      mask[i].im = ( 0.4 * i - N * 0.12) / N;
      filter[i].re = ( 0.7 * i - N * 0.21) / N;
      filter[i].im = ( 0.2 * i + N * 0.40) / N;
      out1[i].re = out1[i].im = 0;
      out2[i].re = out2[i].im = 0;
    }

  Cnum *dest1 = out1;
  Cnum *dest2 = out2;
  const Cnum *src1 = mask;
  const Cnum *src2 = filter + N;
  f1(src1, src2, dest1, N);
  f2(src1, src2, dest2, N);
  Cnum checksum1 = 0;
  Cnum checksum2 = 0;
  for (int i = 0; i < N; ++i)
    {
      checksum1.re += out1[i].re; checksum1.im += out1[i].im;
      checksum2.re += out2[i].re; checksum2.im += out2[i].im;
      if (i < 10 || i >= N - 10)
	__builtin_printf ("out1[%d]=%.6f,%.6f out2[%d]=%.6f,%.6f\n",
			  i, out1[i].re, out1[i].im,
			  i, out2[i].re, out2[i].im);
    }
  if (__builtin_fabs (checksum1.re - checksum2.re) > 0.01
      || __builtin_fabs (checksum1.im - checksum2.im) > 0.01)
    __builtin_abort ();
}

Even current trunk aborts, with -O3 -mavx2 as well as e.g. -O3 -mssse3, but doesn't with -O3 -msse3.

Comment 3 Jakub Jelinek 2019-11-07 20:15:15 UTC
On a closer look, this looks like invalid testcase.
In the first iteration of the loop, src2 is filter + N, which may not be dereferenced, as it points at the end of the filter object.
You can see it e.g. with -fsanitize=undefined,address.
Changing *src2 to src2[-1] fixes the UB in there (e.g. using filter + N - 1 would be invalid, as in the last iteration src2--;
would be invalid pointer arithmetics), but doesn't fix the difference with -O3 -mavx2 or -O3 -mssse3, starting with that revision.

Comment 4 Jakub Jelinek 2019-11-07 20:20:02 UTC
The out of line loop isn't vectorized, but just adding __restrict qualifiers to all the 3 pointers makes it vectorized.  I'll have a look in detail tomorrow.

Comment 6 Marek Polacek 2019-12-18 16:13:13 UTC
PR92420 has been fixed in trunk, but not backported to 9 yet.  Hopefully it will be backported soon, thus making the fix available in DTS 9.1.

Comment 7 Marek Polacek 2020-01-14 18:33:20 UTC
The fix still hasn't been backported to gcc-9.

Comment 8 Marek Polacek 2020-01-14 18:33:59 UTC
And it might not be, so moving to DTS 10.0 for now.

Comment 15 errata-xmlrpc 2020-05-26 06:07:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:2274