Created attachment 1633342 [details] reproducer Description of problem: Compiler vectorizes the loop incorrectly when the loop is inlined Version-Release number of selected component (if applicable): The bug is present in GCC 7 through 9 Reduced test case attached to show bug in vectorization loops. Files included are: t.C and tt.C - source run - build script, note that uncommenting line "OPTS="-03 -mavx2" causes the problem found in run1.txt. run1.txt - bad run output run2.txt - good run output - commenting the above option and uncommenting the options tagged as fixes the problem.
Seems to have started with http://gcc.gnu.org/r238039 Simplified testcase: struct Cnum { float re, im; Cnum (float re_ = 0.0, float im_ = 0.0) : re(re_), im(im_) {} }; static inline void f2 (const Cnum *src1, const Cnum *src2, Cnum *dest, int N) { for (int i = 0; i < N; i++) { const Cnum & A = src1[i]; const Cnum & B = *src2; dest[i].re = (A.re * B.re - A.im * B.im); dest[i].im = (A.im * B.re + A.re * B.im); src2--; } } __attribute__((noinline, noclone, noipa)) void f1 (const Cnum *src1, const Cnum *src2, Cnum *dest, int N) { return f2 (src1, src2, dest, N); } int main () { const int N = 512; Cnum mask[N]; Cnum filter[N]; Cnum out1[N]; Cnum out2[N]; for (int i = 0; i < N; ++i) { mask[i].re = (-0.4 * i + N * 0.30) / N; mask[i].im = ( 0.4 * i - N * 0.12) / N; filter[i].re = ( 0.7 * i - N * 0.21) / N; filter[i].im = ( 0.2 * i + N * 0.40) / N; out1[i].re = out1[i].im = 0; out2[i].re = out2[i].im = 0; } Cnum *dest1 = out1; Cnum *dest2 = out2; const Cnum *src1 = mask; const Cnum *src2 = filter + N; f1(src1, src2, dest1, N); f2(src1, src2, dest2, N); Cnum checksum1 = 0; Cnum checksum2 = 0; for (int i = 0; i < N; ++i) { checksum1.re += out1[i].re; checksum1.im += out1[i].im; checksum2.re += out2[i].re; checksum2.im += out2[i].im; if (i < 10 || i >= N - 10) __builtin_printf ("out1[%d]=%.6f,%.6f out2[%d]=%.6f,%.6f\n", i, out1[i].re, out1[i].im, i, out2[i].re, out2[i].im); } if (__builtin_fabs (checksum1.re - checksum2.re) > 0.01 || __builtin_fabs (checksum1.im - checksum2.im) > 0.01) __builtin_abort (); } Even current trunk aborts, with -O3 -mavx2 as well as e.g. -O3 -mssse3, but doesn't with -O3 -msse3.
On a closer look, this looks like invalid testcase. In the first iteration of the loop, src2 is filter + N, which may not be dereferenced, as it points at the end of the filter object. You can see it e.g. with -fsanitize=undefined,address. Changing *src2 to src2[-1] fixes the UB in there (e.g. using filter + N - 1 would be invalid, as in the last iteration src2--; would be invalid pointer arithmetics), but doesn't fix the difference with -O3 -mavx2 or -O3 -mssse3, starting with that revision.
The out of line loop isn't vectorized, but just adding __restrict qualifiers to all the 3 pointers makes it vectorized. I'll have a look in detail tomorrow.
PR92420 has been fixed in trunk, but not backported to 9 yet. Hopefully it will be backported soon, thus making the fix available in DTS 9.1.
The fix still hasn't been backported to gcc-9.
And it might not be, so moving to DTS 10.0 for now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:2274