1576167 – [LLNL DTS Bug] __mulsc3 overhead is greatly impacting important code.

Bug 1576167 - [LLNL DTS Bug] __mulsc3 overhead is greatly impacting important code.

Summary: [LLNL DTS Bug] __mulsc3 overhead is greatly impacting important code.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Developer Toolset
Classification:	Red Hat
Component:	gcc
Sub Component:
Version:	DTS 7.1 RHEL 7
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	alpha
Target Release:	7.1
Assignee:	Marek Polacek
QA Contact:	Alexandra Petlanová Hájková
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1498541 1576505
TreeView+	depends on / blocked

Reported:	2018-05-09 03:45 UTC by Ben Woodard
Modified:	2019-12-10 07:49 UTC (History)
CC List:	9 users (show)
Fixed In Version:	devtoolset-9-gcc-9.1.1-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-10 07:49:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
GNU Compiler Collection	81478	0	None	None	None	2019-07-11 19:15:51 UTC
Red Hat Product Errata	RHEA-2019:4134	0	None	None	None	2019-12-10 07:49:45 UTC

Description Ben Woodard 2018-05-09 03:45:46 UTC

Description of problem:
There is an important multi-physics code which the labs use called pf3d. Several of its loops make heavy use of complex numbers and the codes run much slower on GCC compiled code than on intel compiled code.

When you look at the perf reports you can readily see why:
When you do a perf record of the single threaded code you provided with ICC you see:
  48.46%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] __svml_sincosf8_e9
   8.59%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] le_nrandom
   8.28%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] rotth
   7.44%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] init_real
   6.80%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] couple4_nocomp
   5.94%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] init_complex
when you do the same thing for GCC 7.2.1 you see:
  45.51%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] __mulsc3
  16.79%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] rotth._omp_fn.0
  14.82%  pf3dtest-omp.gc  libm-2.17.so         [.] __sincosf
   5.95%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] couple4_nocomp._omp_fn.1
   5.86%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] couple4._omp_fn.0

This looks like this problem has been observed before in GCC PR81478


Version-Release number of selected component (if applicable):
gcc --version
gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

devtoolset-7-gcc-7.3.1-5.4.el7.x86_64

How reproducible:
always

Comment 5 Jakub Jelinek 2018-05-09 07:30:08 UTC

There are several options that affect how complex multiplication is expanded.
First of all, is this Fortran or C or C++?
Are you using or not using -f{,no-}cx-limited-range and/or -f{,no-}cx-fortran-rules?  What about -ffast-math or at least -f{,no-}signalling-nans?
The C99 requirements are expensive, with -fcx-limited-range -fno-cx-fortran-rules it should be usually emitted inline and cheaper, with -fcx-fortran-rules something in between.
Based on these options (and -O0, and whether the code is considered cold) the complex division is expanded either as inline
x = (ar*br - ai*bi) + i(ar*bi + br*ai);
or as
x = (ar*br - ai*bi) + i(ar*bi + br*ai);
if (isunordered (__real__ x, __imag__ x))
  x = __mulsc3 (a, b);
or as
x = __mulsc3 (a, b);

So, does the program care about handling of NaNs in the complex multiplication or not?

Comment 7 Ben Woodard 2018-05-09 19:04:47 UTC

(In reply to Jakub Jelinek from comment #5)
> There are several options that affect how complex multiplication is expanded.
> First of all, is this Fortran or C or C++?

The example code is C

> Are you using or not using -f{,no-}cx-limited-range and/or
> -f{,no-}cx-fortran-rules?  What about -ffast-math or at least
> -f{,no-}signalling-nans?
> The C99 requirements are expensive, with -fcx-limited-range
> -fno-cx-fortran-rules it should be usually emitted inline and cheaper, with
> -fcx-fortran-rules something in between.
> Based on these options (and -O0, and whether the code is considered cold)
> the complex division is expanded either as inline
> x = (ar*br - ai*bi) + i(ar*bi + br*ai);
> or as
> x = (ar*br - ai*bi) + i(ar*bi + br*ai);
> if (isunordered (__real__ x, __imag__ x))
>   x = __mulsc3 (a, b);
> or as
> x = __mulsc3 (a, b);
> 
> So, does the program care about handling of NaNs in the complex
> multiplication or not?

Finding out about the rest from the user. We'll see if they can use -ffast-math or any of the other options.

Comment 8 Ben Woodard 2018-05-10 16:23:14 UTC

Regarding -ffast-math

"I usually build the app (the parent code of the kernels) with float precision (float for real numbers and float complex for complex numbers). Fast-math options sometimes cause the loss of more bits than I would like. An example is the vector sincos function in Intel's MKL. It has roughly 1 bit less precision than you get from separate vsin and vcos functions. The problem is that some of the algorithms in pF3D "amplify noise". A vector sincos wouldn't be a problem if I used double precision."

Something to note about this which applies more to glibc is that when we did the performance work on pow and exp, they reduced the accuracy guarantee from 0.5ULP to 1ULP when operating on doubles. The argues that the same technique for improving performance for sin and cos can't be applied over the single precision float domain. I do not know where to record that fact.

wiith -ffast-math
 35.75%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] rotth._omp_fn.0
 33.76%  pf3dtest-omp.gc  libm-2.17.so           [.] __sincosf
  8.70%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] couple4_nocomp._omp_fn.1
  6.11%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] couple4._omp_fn.0
  3.32%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] le_nrandom
  3.32%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] init_real
  2.29%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] init_complex
without: 
 44.69%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] __mulsc3
 17.20%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] rotth._omp_fn.0
 14.57%  pf3dtest-omp.gc  libm-2.17.so         [.] __sincosf
  6.21%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] couple4._omp_fn.0
  6.00%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] couple4_nocomp._omp_fn.1
  2.37%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] init_real
  2.30%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] le_nrandom
  1.57%  pf3dtest-omp.gc  pf3dtest-omp.gcc721  [.] init_complex

The actual performance didn't improve sufficiently:
*** results.gcc721.1    2018-05-09 10:04:53.190196000 -0700
--- results.gcc721fm.1  2018-05-09 10:04:59.225152000 -0700
+++ results.icc1801.1   2018-05-09 10:04:44.527316000 -0700
*zones/sec/thread = 6.959040e+07, tot zones/sec = 6.959040e+07
-zones/sec/thread = 9.925048e+07, tot zones/sec = 9.925048e+07 
+zones/sec/thread = 2.696842e+08, tot zones/sec = 2.696842e+08

Note that in this case GCC with FM is still only 1/3 the speed of ICC.

> So, does the program care about handling of NaNs in the complex
> multiplication or not?

"In production runs, I always try to generate an interrupt when a NAN or INF is generated and I flush denormals to zero. That means I don't anticipate any NANs to be generated and wouldn't make any provision for "fix up" after NAN generation. I want the code to throw an interrupt and I will then fix it so that the exception goes away. 

The sin() and cos() operate on real numbers and the angles in the complex plane should be in the range 0 to 2*pi or -pi to pi. That means we could skip some of the range reduction stuff that sin/cos library functions need to support. There shouldn't be any way to generate a NAN with a sin or cos call."

So we should be able to avoid those bounds checks. I'll look carefully at the various -fcx-* options to see if I can find a set which can further improve performance on these. I'll rerun the tests with -fcx-limited-range -fno-cx-fortran-rules without -ffast-math for comparison

Would it help to have a disassembly of the ICC generated function to see what instructions it used?

Comment 9 Ben Woodard 2018-05-11 17:26:51 UTC

Even with -Ofast -mavx2 and -ffast-math

GCC generates:
  33.84%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] rotth._omp_fn.0                                                                         
  31.72%  pf3dtest-omp.gc  libm-2.17.so           [.] __sincosf                                                                               
   9.58%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] couple4_nocomp._omp_fn.1                                                                
   6.41%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] couple4._omp_fn.0                                                                       
   3.73%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] le_nrandom                                                                              
   3.65%  pf3dtest-omp.gc  pf3dtest-omp.gcc721fm  [.] init_real          

While ICC ends up with:
  47.03%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] __svml_sincosf8_e9
   8.81%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] rotth
   8.26%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] init_real
   8.20%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] le_nrandom
   7.05%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] couple4_nocomp
   5.39%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] init_complex
   4.62%  pf3dtest-omp.ic  pf3dtest-omp.icc1801  [.] couple4      

Digging a bit deeper into the instructions being generated:

When I look at the code that Intel generated I see:
  6.75 │       vinser $0x1,0x10(%r14,%r15,4),%ymm0,%ymm0
  0.64 │       callq  _DYNAMIC+0x230
       │       vunpck %ymm0,%ymm1,%ymm2
       │         for (iy=0; iy<nyl; iy++) {
       │             thetloc= thetb+iy*nxl;
       │     #ifdef _OPENMP
       │         #pragma omp simd
       │     #endif
       │             for (ix=0; ix<nxl_loc; ix++) {
  6.42 │       add    $0x8,%r12d
       │                 carr(ix,iy)= carr(ix,iy)*(COS(thetloc[ix])+IREAL*SIN(thetloc[ix]));
       │       vunpck %ymm0,%ymm1,%ymm3
       │       vperm2 $0x20,%ymm3,%ymm2,%ymm5
       │       vperm2 $0x31,%ymm3,%ymm2,%ymm11
  4.71 │       vmovsl (%rsi,%r15,8),%ymm4
       │       vmovsh (%rsi,%r15,8),%ymm6
       │       vmovsl 0x20(%rsi,%r15,8),%ymm10
       │       vmovsh 0x20(%rsi,%r15,8),%ymm12
  7.17 │       vshufp $0xb1,%ymm5,%ymm5,%ymm7
       │       vshufp $0xb1,%ymm11,%ymm11,%ymm13
       │       vmulps %ymm5,%ymm4,%ymm8
  8.14 │       vmulps %ymm7,%ymm6,%ymm9
  7.71 │       vmulps %ymm11,%ymm10,%ymm14
  4.82 │       vmulps %ymm13,%ymm12,%ymm15
  6.32 │       vaddsu %ymm9,%ymm8,%ymm0
  8.89 │       vaddsu %ymm15,%ymm14,%ymm1
 10.17 │       vmovup %ymm0,(%rsi,%r15,8)
  3.85 │       vmovup %ymm1,0x20(%rsi,%r15,8)
       │         for (iy=0; iy<nyl; iy++) {
       │             thetloc= thetb+iy*nxl;
       │     #ifdef _OPENMP
       │         #pragma omp simd
       │     #endif
The instructions that GCC generates are much more compact set of instructions but they don't seem to be quite as fast:
       │     #ifdef _OPENMP
       │         #pragma omp simd
       │     #endif
       │             for (ix=0; ix<nxl_loc; ix++) {
       │                 carr(ix,iy)= carr(ix,iy)*(COS(thetloc[ix])+IREAL*SIN(thetloc[ix]));
  0.01 │       vmovss 0xc(%rsp),%xmm2
  4.85 │       vmovss 0x8(%rsp),%xmm1
       │       vmulss %xmm4,%xmm2,%xmm0
 19.33 │       vmulss %xmm3,%xmm1,%xmm5
  1.85 │       vmulss %xmm3,%xmm2,%xmm2
  5.19 │       vmulss %xmm4,%xmm1,%xmm1
  3.61 │       vsubss %xmm5,%xmm0,%xmm0
 12.58 │       vaddss %xmm2,%xmm1,%xmm1
  9.06 │       vmovss %xmm0,-0x8(%r14)
  2.20 │       vmovss %xmm1,-0x4(%r14)
  6.32 │       cmp    %r14,%rbx
       │       jne    b0
       │       add    $0x1,%ebp
       │       add    0x18(%rsp),%r13
       │       add    0x10(%rsp),%rbx
       │       cmp    %ebp,%r12d
       │       jne    a0
       │            a sine and cosine. This may be slower if a fast sincos()
       │            is available */
       │         nxl_loc= nxl;
       │
       │     #ifdef _OPENMP
       │     #pragma omp parallel for private(ix, iy, thetloc)

One big difference that immediately jumps out at me is that we're using xmm registers rather than ymm registers. Why is that?

Comment 10 Marek Polacek 2018-07-13 19:07:07 UTC

PR70291 has been implemented for GCC 9.  So this is going to be fixed in DTS 9.

Comment 11 Ben Woodard 2018-07-20 17:50:33 UTC

Is the fix one of those things which will be easy to backport? What are the chances it could make DTS 8.1? 

The place that it matters is really for DTS for ARM.

Comment 12 Marek Polacek 2018-07-20 18:44:48 UTC

It doesn't seem that the fix will be backported to 8.1.

Comment 13 Ben Woodard 2018-08-16 18:20:59 UTC

I heard from the guys at LLNL that ARM had done something -- evidently in glibc which improved the performance of this considerably. LLNL is happy with what they did. Does anybody happen to know what they did?

Given that whatever the solution was is not in DTS. This may be filed against the wrong component now.

Comment 14 Ben Woodard 2018-08-16 19:39:57 UTC

Just got additional information from LLNL. Evidently this is targeted correctly. The fix will be delivered in GCC9 and it has to do with the inlining of cimag(), creal(), and conj() into the -O2 optimization level rather than requiring -Ofast.

Comment 15 Marek Polacek 2018-11-20 17:54:41 UTC

Clearing the needinfo -- given Comment 10.

Comment 16 Marek Polacek 2019-07-11 19:22:25 UTC

dev_ack+ since PR70291 is fixed in GCC 9.

Comment 20 errata-xmlrpc 2019-12-10 07:49:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:4134

Note You need to log in before you can comment on or make changes to this bug.