Description of problem: There is an important multi-physics code which the labs use called pf3d. Several of its loops make heavy use of complex numbers and the codes run much slower on GCC compiled code than on intel compiled code. When you look at the perf reports you can readily see why: When you do a perf record of the single threaded code you provided with ICC you see: 48.46% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] __svml_sincosf8_e9 8.59% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] le_nrandom 8.28% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] rotth 7.44% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] init_real 6.80% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] couple4_nocomp 5.94% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] init_complex when you do the same thing for GCC 7.2.1 you see: 45.51% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] __mulsc3 16.79% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] rotth._omp_fn.0 14.82% pf3dtest-omp.gc libm-2.17.so [.] __sincosf 5.95% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] couple4_nocomp._omp_fn.1 5.86% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] couple4._omp_fn.0 This looks like this problem has been observed before in GCC PR81478 Version-Release number of selected component (if applicable): gcc --version gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. devtoolset-7-gcc-7.3.1-5.4.el7.x86_64 How reproducible: always
There are several options that affect how complex multiplication is expanded. First of all, is this Fortran or C or C++? Are you using or not using -f{,no-}cx-limited-range and/or -f{,no-}cx-fortran-rules? What about -ffast-math or at least -f{,no-}signalling-nans? The C99 requirements are expensive, with -fcx-limited-range -fno-cx-fortran-rules it should be usually emitted inline and cheaper, with -fcx-fortran-rules something in between. Based on these options (and -O0, and whether the code is considered cold) the complex division is expanded either as inline x = (ar*br - ai*bi) + i(ar*bi + br*ai); or as x = (ar*br - ai*bi) + i(ar*bi + br*ai); if (isunordered (__real__ x, __imag__ x)) x = __mulsc3 (a, b); or as x = __mulsc3 (a, b); So, does the program care about handling of NaNs in the complex multiplication or not?
(In reply to Jakub Jelinek from comment #5) > There are several options that affect how complex multiplication is expanded. > First of all, is this Fortran or C or C++? The example code is C > Are you using or not using -f{,no-}cx-limited-range and/or > -f{,no-}cx-fortran-rules? What about -ffast-math or at least > -f{,no-}signalling-nans? > The C99 requirements are expensive, with -fcx-limited-range > -fno-cx-fortran-rules it should be usually emitted inline and cheaper, with > -fcx-fortran-rules something in between. > Based on these options (and -O0, and whether the code is considered cold) > the complex division is expanded either as inline > x = (ar*br - ai*bi) + i(ar*bi + br*ai); > or as > x = (ar*br - ai*bi) + i(ar*bi + br*ai); > if (isunordered (__real__ x, __imag__ x)) > x = __mulsc3 (a, b); > or as > x = __mulsc3 (a, b); > > So, does the program care about handling of NaNs in the complex > multiplication or not? Finding out about the rest from the user. We'll see if they can use -ffast-math or any of the other options.
Regarding -ffast-math "I usually build the app (the parent code of the kernels) with float precision (float for real numbers and float complex for complex numbers). Fast-math options sometimes cause the loss of more bits than I would like. An example is the vector sincos function in Intel's MKL. It has roughly 1 bit less precision than you get from separate vsin and vcos functions. The problem is that some of the algorithms in pF3D "amplify noise". A vector sincos wouldn't be a problem if I used double precision." Something to note about this which applies more to glibc is that when we did the performance work on pow and exp, they reduced the accuracy guarantee from 0.5ULP to 1ULP when operating on doubles. The argues that the same technique for improving performance for sin and cos can't be applied over the single precision float domain. I do not know where to record that fact. wiith -ffast-math 35.75% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] rotth._omp_fn.0 33.76% pf3dtest-omp.gc libm-2.17.so [.] __sincosf 8.70% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] couple4_nocomp._omp_fn.1 6.11% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] couple4._omp_fn.0 3.32% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] le_nrandom 3.32% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] init_real 2.29% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] init_complex without: 44.69% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] __mulsc3 17.20% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] rotth._omp_fn.0 14.57% pf3dtest-omp.gc libm-2.17.so [.] __sincosf 6.21% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] couple4._omp_fn.0 6.00% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] couple4_nocomp._omp_fn.1 2.37% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] init_real 2.30% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] le_nrandom 1.57% pf3dtest-omp.gc pf3dtest-omp.gcc721 [.] init_complex The actual performance didn't improve sufficiently: *** results.gcc721.1 2018-05-09 10:04:53.190196000 -0700 --- results.gcc721fm.1 2018-05-09 10:04:59.225152000 -0700 +++ results.icc1801.1 2018-05-09 10:04:44.527316000 -0700 *zones/sec/thread = 6.959040e+07, tot zones/sec = 6.959040e+07 -zones/sec/thread = 9.925048e+07, tot zones/sec = 9.925048e+07 +zones/sec/thread = 2.696842e+08, tot zones/sec = 2.696842e+08 Note that in this case GCC with FM is still only 1/3 the speed of ICC. > So, does the program care about handling of NaNs in the complex > multiplication or not? "In production runs, I always try to generate an interrupt when a NAN or INF is generated and I flush denormals to zero. That means I don't anticipate any NANs to be generated and wouldn't make any provision for "fix up" after NAN generation. I want the code to throw an interrupt and I will then fix it so that the exception goes away. The sin() and cos() operate on real numbers and the angles in the complex plane should be in the range 0 to 2*pi or -pi to pi. That means we could skip some of the range reduction stuff that sin/cos library functions need to support. There shouldn't be any way to generate a NAN with a sin or cos call." So we should be able to avoid those bounds checks. I'll look carefully at the various -fcx-* options to see if I can find a set which can further improve performance on these. I'll rerun the tests with -fcx-limited-range -fno-cx-fortran-rules without -ffast-math for comparison Would it help to have a disassembly of the ICC generated function to see what instructions it used?
Even with -Ofast -mavx2 and -ffast-math GCC generates: 33.84% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] rotth._omp_fn.0 31.72% pf3dtest-omp.gc libm-2.17.so [.] __sincosf 9.58% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] couple4_nocomp._omp_fn.1 6.41% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] couple4._omp_fn.0 3.73% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] le_nrandom 3.65% pf3dtest-omp.gc pf3dtest-omp.gcc721fm [.] init_real While ICC ends up with: 47.03% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] __svml_sincosf8_e9 8.81% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] rotth 8.26% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] init_real 8.20% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] le_nrandom 7.05% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] couple4_nocomp 5.39% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] init_complex 4.62% pf3dtest-omp.ic pf3dtest-omp.icc1801 [.] couple4 Digging a bit deeper into the instructions being generated: When I look at the code that Intel generated I see: 6.75 │ vinser $0x1,0x10(%r14,%r15,4),%ymm0,%ymm0 0.64 │ callq _DYNAMIC+0x230 │ vunpck %ymm0,%ymm1,%ymm2 │ for (iy=0; iy<nyl; iy++) { │ thetloc= thetb+iy*nxl; │ #ifdef _OPENMP │ #pragma omp simd │ #endif │ for (ix=0; ix<nxl_loc; ix++) { 6.42 │ add $0x8,%r12d │ carr(ix,iy)= carr(ix,iy)*(COS(thetloc[ix])+IREAL*SIN(thetloc[ix])); │ vunpck %ymm0,%ymm1,%ymm3 │ vperm2 $0x20,%ymm3,%ymm2,%ymm5 │ vperm2 $0x31,%ymm3,%ymm2,%ymm11 4.71 │ vmovsl (%rsi,%r15,8),%ymm4 │ vmovsh (%rsi,%r15,8),%ymm6 │ vmovsl 0x20(%rsi,%r15,8),%ymm10 │ vmovsh 0x20(%rsi,%r15,8),%ymm12 7.17 │ vshufp $0xb1,%ymm5,%ymm5,%ymm7 │ vshufp $0xb1,%ymm11,%ymm11,%ymm13 │ vmulps %ymm5,%ymm4,%ymm8 8.14 │ vmulps %ymm7,%ymm6,%ymm9 7.71 │ vmulps %ymm11,%ymm10,%ymm14 4.82 │ vmulps %ymm13,%ymm12,%ymm15 6.32 │ vaddsu %ymm9,%ymm8,%ymm0 8.89 │ vaddsu %ymm15,%ymm14,%ymm1 10.17 │ vmovup %ymm0,(%rsi,%r15,8) 3.85 │ vmovup %ymm1,0x20(%rsi,%r15,8) │ for (iy=0; iy<nyl; iy++) { │ thetloc= thetb+iy*nxl; │ #ifdef _OPENMP │ #pragma omp simd │ #endif The instructions that GCC generates are much more compact set of instructions but they don't seem to be quite as fast: │ #ifdef _OPENMP │ #pragma omp simd │ #endif │ for (ix=0; ix<nxl_loc; ix++) { │ carr(ix,iy)= carr(ix,iy)*(COS(thetloc[ix])+IREAL*SIN(thetloc[ix])); 0.01 │ vmovss 0xc(%rsp),%xmm2 4.85 │ vmovss 0x8(%rsp),%xmm1 │ vmulss %xmm4,%xmm2,%xmm0 19.33 │ vmulss %xmm3,%xmm1,%xmm5 1.85 │ vmulss %xmm3,%xmm2,%xmm2 5.19 │ vmulss %xmm4,%xmm1,%xmm1 3.61 │ vsubss %xmm5,%xmm0,%xmm0 12.58 │ vaddss %xmm2,%xmm1,%xmm1 9.06 │ vmovss %xmm0,-0x8(%r14) 2.20 │ vmovss %xmm1,-0x4(%r14) 6.32 │ cmp %r14,%rbx │ jne b0 │ add $0x1,%ebp │ add 0x18(%rsp),%r13 │ add 0x10(%rsp),%rbx │ cmp %ebp,%r12d │ jne a0 │ a sine and cosine. This may be slower if a fast sincos() │ is available */ │ nxl_loc= nxl; │ │ #ifdef _OPENMP │ #pragma omp parallel for private(ix, iy, thetloc) One big difference that immediately jumps out at me is that we're using xmm registers rather than ymm registers. Why is that?
PR70291 has been implemented for GCC 9. So this is going to be fixed in DTS 9.
Is the fix one of those things which will be easy to backport? What are the chances it could make DTS 8.1? The place that it matters is really for DTS for ARM.
It doesn't seem that the fix will be backported to 8.1.
I heard from the guys at LLNL that ARM had done something -- evidently in glibc which improved the performance of this considerably. LLNL is happy with what they did. Does anybody happen to know what they did? Given that whatever the solution was is not in DTS. This may be filed against the wrong component now.
Just got additional information from LLNL. Evidently this is targeted correctly. The fix will be delivered in GCC9 and it has to do with the inlining of cimag(), creal(), and conj() into the -O2 optimization level rather than requiring -Ofast.
Clearing the needinfo -- given Comment 10.
dev_ack+ since PR70291 is fixed in GCC 9.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:4134