Bug 1769410 - Compiler vectorizes the loop incorrectly when the loop is inlined
Summary: Compiler vectorizes the loop incorrectly when the loop is inlined
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Developer Toolset
Classification: Red Hat
Component: gcc
Version: DTS 9.1 RHEL 7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Marek Polacek
QA Contact: Alexandra Petlanová Hájková
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-06 15:05 UTC by Piyush Bhoot
Modified: 2023-10-06 18:44 UTC (History)
9 users (show)

Fixed In Version: devtoolset-9-gcc-9.3.1-1.el7
Doc Type: Bug Fix
Doc Text:
Cause: A bug in the basic block vectorization handling in the vectorizer. Consequence: Vectorizing a loop with negative strides produced wrong code. Fix: Basic block vectorization has been amended. Result: The compiler produces code that executes correctly.
Clone Of:
Environment:
Last Closed: 2020-05-26 06:07:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer (12.50 KB, application/x-tar)
2019-11-06 15:05 UTC, Piyush Bhoot
no flags Details


Links
System ID Private Priority Status Summary Last Updated
GNU Compiler Collection 92420 0 P2 RESOLVED [8 Regression] Vectorization miscompilation with negative strides since r238039 2020-05-20 19:28:07 UTC
Red Hat Product Errata RHSA-2020:2274 0 None None None 2020-05-26 06:07:46 UTC

Description Piyush Bhoot 2019-11-06 15:05:47 UTC
Created attachment 1633342 [details]
reproducer

Description of problem:
Compiler vectorizes the loop incorrectly when the loop is inlined

Version-Release number of selected component (if applicable):
The bug is present in GCC 7 through 9

Reduced test case attached to show bug in vectorization loops.

Files included are:

t.C and tt.C - source
run - build script, note that uncommenting line "OPTS="-03 -mavx2" causes the
problem found in run1.txt.
run1.txt - bad run output
run2.txt - good run output - commenting the above option and uncommenting the
options tagged as fixes the problem.

Comment 2 Jakub Jelinek 2019-11-07 20:02:29 UTC
Seems to have started with http://gcc.gnu.org/r238039
Simplified testcase:
struct Cnum
{
  float re, im;
  Cnum (float re_ = 0.0, float im_ = 0.0) : re(re_), im(im_) {}
};

static inline void
f2 (const Cnum *src1, const Cnum *src2, Cnum *dest, int N)
{
  for (int i = 0; i < N; i++)
    {
      const Cnum & A = src1[i];
      const Cnum & B = *src2;
      dest[i].re = (A.re * B.re - A.im * B.im);
      dest[i].im = (A.im * B.re + A.re * B.im);
      src2--;
    }
}

__attribute__((noinline, noclone, noipa)) void
f1 (const Cnum *src1, const Cnum *src2, Cnum *dest, int N)
{
  return f2 (src1, src2, dest, N);
}

int
main ()
{
  const int N = 512;
  Cnum mask[N];
  Cnum filter[N];
  Cnum out1[N];
  Cnum out2[N];
  for (int i = 0; i < N; ++i)
    {
      mask[i].re = (-0.4 * i + N * 0.30) / N;
      mask[i].im = ( 0.4 * i - N * 0.12) / N;
      filter[i].re = ( 0.7 * i - N * 0.21) / N;
      filter[i].im = ( 0.2 * i + N * 0.40) / N;
      out1[i].re = out1[i].im = 0;
      out2[i].re = out2[i].im = 0;
    }

  Cnum *dest1 = out1;
  Cnum *dest2 = out2;
  const Cnum *src1 = mask;
  const Cnum *src2 = filter + N;
  f1(src1, src2, dest1, N);
  f2(src1, src2, dest2, N);
  Cnum checksum1 = 0;
  Cnum checksum2 = 0;
  for (int i = 0; i < N; ++i)
    {
      checksum1.re += out1[i].re; checksum1.im += out1[i].im;
      checksum2.re += out2[i].re; checksum2.im += out2[i].im;
      if (i < 10 || i >= N - 10)
	__builtin_printf ("out1[%d]=%.6f,%.6f out2[%d]=%.6f,%.6f\n",
			  i, out1[i].re, out1[i].im,
			  i, out2[i].re, out2[i].im);
    }
  if (__builtin_fabs (checksum1.re - checksum2.re) > 0.01
      || __builtin_fabs (checksum1.im - checksum2.im) > 0.01)
    __builtin_abort ();
}

Even current trunk aborts, with -O3 -mavx2 as well as e.g. -O3 -mssse3, but doesn't with -O3 -msse3.

Comment 3 Jakub Jelinek 2019-11-07 20:15:15 UTC
On a closer look, this looks like invalid testcase.
In the first iteration of the loop, src2 is filter + N, which may not be dereferenced, as it points at the end of the filter object.
You can see it e.g. with -fsanitize=undefined,address.
Changing *src2 to src2[-1] fixes the UB in there (e.g. using filter + N - 1 would be invalid, as in the last iteration src2--;
would be invalid pointer arithmetics), but doesn't fix the difference with -O3 -mavx2 or -O3 -mssse3, starting with that revision.

Comment 4 Jakub Jelinek 2019-11-07 20:20:02 UTC
The out of line loop isn't vectorized, but just adding __restrict qualifiers to all the 3 pointers makes it vectorized.  I'll have a look in detail tomorrow.

Comment 6 Marek Polacek 2019-12-18 16:13:13 UTC
PR92420 has been fixed in trunk, but not backported to 9 yet.  Hopefully it will be backported soon, thus making the fix available in DTS 9.1.

Comment 7 Marek Polacek 2020-01-14 18:33:20 UTC
The fix still hasn't been backported to gcc-9.

Comment 8 Marek Polacek 2020-01-14 18:33:59 UTC
And it might not be, so moving to DTS 10.0 for now.

Comment 15 errata-xmlrpc 2020-05-26 06:07:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:2274


Note You need to log in before you can comment on or make changes to this bug.