Bug 1422848 - Long compilation time for aarch64 OpenMP enabled application
Summary: Long compilation time for aarch64 OpenMP enabled application
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Developer Toolset
Classification: Red Hat
Component: gcc
Version: DTS 7.0 RHEL 7
Hardware: aarch64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Jakub Jelinek
QA Contact: Michael Petlan
URL:
Whiteboard:
Depends On:
Blocks: 1402684
TreeView+ depends on / blocked
 
Reported: 2017-02-16 11:42 UTC by Petr Sury
Modified: 2017-10-24 09:47 UTC (History)
11 users (show)

Fixed In Version: devtoolset-7-gcc-7.1.1-7.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-24 09:47:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Reproducer (207.59 KB, application/x-gzip)
2017-02-16 11:42 UTC, Petr Sury
no flags Details


Links
System ID Private Priority Status Summary Last Updated
GNU Compiler Collection 78699 0 None None None 2017-03-08 21:06:55 UTC
Red Hat Bugzilla 1389276 0 medium CLOSED Linking errors on aarch64 for OpenMP enabled application - R_AARCH64_ABS64 used with TLS symbol work_lhs_ 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2017:3016 0 normal SHIPPED_LIVE new packages: devtoolset-7-gcc 2017-10-24 13:21:49 UTC

Internal Links: 1389276

Description Petr Sury 2017-02-16 11:42:42 UTC
Created attachment 1250831 [details]
Reproducer

Hello everyone,
  on APM X-Gene CPU Potenza A3 with devtools-6 enabled (which means gcc version 6.3.1 20170118 ) compiling of LU benchmark (one of NAS Parallel benchmarks [1]). takes for some strange reasons about 45 minutes, usually it should take about minute or two.

Version-Release number of selected component (if applicable):
  gcc version 6.3.1 20170118

How reproducible:
Compile lu becnhmark from NAS Parallel benchmarks - reproducer attached

Steps to Reproduce:
1. download attached tar
2. untar it
3. execute reproducer.sh


Actual results:
  real 48m1.888s user 48m1.913s sys 0m0.170s 

Expected results:
  Minute or two

Additional info:
[1] https://www.nas.nasa.gov/publications/npb.html

Comment 2 Jiri Hladky 2017-02-16 13:12:33 UTC
Some more comments: 

1) We need to use gcc version >= 6.3 See https://bugzilla.redhat.com/show_bug.cgi?id=1389276#c9

2) On HP m400 the compilation takes under 1 minute. The problem is specific to Mustang systems. 

Jirka

Comment 3 Jakub Jelinek 2017-02-16 13:22:58 UTC
With the same command line options (and no -march=native)?  Then the only reason I can think of would be you don't have enough memory and swap to death.

Comment 4 Jiri Hladky 2017-02-16 13:32:29 UTC
Both systems (HP m400 and Mustang) has the same amount of RAM - 16GB. 

@Petr - could you please check the exact command line options being used?

Comment 5 Jeff Bastian 2017-02-16 21:08:43 UTC
(In reply to Jiri Hladky from comment #4)
> Both systems (HP m400 and Mustang) has the same amount of RAM - 16GB. 

The HP m400 systems have 64GB RAM.

[root@hp-moonshot-03-c01 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:             63           0          61           0           1          57
Swap:            11           0          11

Comment 6 Jeff Bastian 2017-02-16 21:42:35 UTC
It seems to be related to the -O3 optimizations.  I removed that flag from config/make.def and the build time is much better without it.

I was watching the system with 'top' while building with -O3 and the memory usage was normal (plenty of free RAM, no swap used), the CPU was just pegged at 100% trying to do O3 optimizations.

::::::::::::::
:: With -O3 ::
::::::::::::::

[jbastian@centipede NPB3.3-OMP]$ which gfortran
/opt/rh/devtoolset-6/root/usr/bin/gfortran

[jbastian@centipede NPB3.3-OMP]$ gfortran --version | head -1
GNU Fortran (GCC) 6.3.1 20170118 (Red Hat 6.3.1-2)

[jbastian@centipede NPB3.3-OMP]$ time make lu CLASS=C
...
gfortran -c  -O3 -fopenmp -mcmodel=large rhs.f
^C
make: *** [lu] Interrupt

real	7m48.872s
user	0m4.571s
sys	0m0.134s

:::::::::::::::::
:: Without -O3 ::
:::::::::::::::::

[jbastian@centipede NPB3.3-OMP]$ vi config/make.def

[jbastian@centipede NPB3.3-OMP]$ make clean
...

[jbastian@centipede NPB3.3-OMP]$ time make lu CLASS=C
...
gfortran -c  -fopenmp -mcmodel=large rhs.f
gfortran -c  -fopenmp -mcmodel=large l2norm.f
...
gfortran -fopenmp -mcmodel=large -o ../bin/lu.C.x lu.o read_input.o domain.o setcoeff.o setbv.o exact.o setiv.o erhs.o ssor.o rhs.o l2norm.o jacld.o blts.o jacu.o buts.o error.o syncs.o pintgr.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o 
make[2]: Leaving directory '/home/jbastian/NPB/bz1422848/NPB3.3-OMP/LU'
make[1]: Leaving directory '/home/jbastian/NPB/bz1422848/NPB3.3-OMP/LU'

real	0m3.015s
user	0m2.583s
sys	0m0.200s

Comment 7 Jeff Bastian 2017-02-16 21:44:27 UTC
Using -O2 also works well:

[jbastian@centipede NPB3.3-OMP]$ time make lu CLASS=C
...
gfortran -c  -O2 -fopenmp -mcmodel=large ssor.f
gfortran -c  -O2 -fopenmp -mcmodel=large rhs.f
gfortran -c  -O2 -fopenmp -mcmodel=large l2norm.f
...
real	0m6.423s
user	0m6.053s
sys	0m0.133s

Comment 8 Jeff Bastian 2017-02-16 21:51:07 UTC
Petr, I'm not able to reproduce your HP m400 results.  That is, it also gets stuck in O3 optimizations for a very long time for me on an HP m400 system.


[jbastian@hp-moonshot-03-c01 NPB3.3-OMP]$ which gfortran
/opt/rh/devtoolset-6/root/usr/bin/gfortran

[jbastian@hp-moonshot-03-c01 NPB3.3-OMP]$ gfortran --version | head -1
GNU Fortran (GCC) 6.3.1 20170118 (Red Hat 6.3.1-2)

[jbastian@hp-moonshot-03-c01 NPB3.3-OMP]$ make clean
...

[jbastian@hp-moonshot-03-c01 NPB3.3-OMP]$ time make lu CLASS=C
...
gfortran -c  -O3 -fopenmp -mcmodel=large rhs.f
^C
make: *** [lu] Interrupt

real	2m35.502s
user	0m4.668s
sys	0m0.073s




While it was building, the top output (hiding idle processes):

top - 16:49:39 up 2 days, 22:46,  2 users,  load average: 0.83, 0.32, 0.20
Tasks: 222 total,   3 running, 219 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us,  0.0 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 66940352 total, 63613312 free,  1152320 used,  2174720 buff/cache
KiB Swap: 11722688 total, 11722688 free,        0 used. 59443008 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
24482 jbastian  20   0  455808 309056  16064 R 100.0  0.5   1:44.36 f951
  511 root      20   0       0      0      0 S   0.3  0.0   0:04.85 xfsaild/dm+
24522 root      20   0  126016   8256   3968 R   0.3  0.0   0:00.06 top

Comment 9 Petr Sury 2017-02-17 19:41:03 UTC
Jirka, flags are -O3 -fopenmp -mcmodel=large or in some cases without -O3

Jeff, it seams I am not using -O3 on HP machine which explains why I missed performance problem.

Comment 10 Petr Sury 2017-03-02 10:31:11 UTC
Hello everyone,
  short summary of current status. It seems that compilation problem is caused by optimization switch, to be concrete -O3.
  Compilation time for -O3 is about 50 minutes on both systems (HP m400 and Mustang), -O2 compilation time is about minute or two. 
  As for memory usage, both systems have more than enough memory (64 and 16 GB RAM).  /usr/bin/time -v says that Maximum resident set size (kbytes): 342848. (Full log bellow)

  If you need any more information, please let me know.

Log form mustang system

Command being timed: "make lu CLASS=C"
User time (seconds): 2884.58
System time (seconds): 0.24
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 48:04.61
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 342848
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 27944
Voluntary context switches: 220
Involuntary context switches: 1090
Swaps: 0
File system inputs: 0
File system outputs: 7040
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 65536
Exit status: 0

Log from HP m400 system

Command being timed: "make lu CLASS=C"
User time (seconds): 3226.08
System time (seconds): 0.25
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 53:46.40
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 342848
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 27944
Voluntary context switches: 275
Involuntary context switches: 994
Swaps: 0
File system inputs: 1096
File system outputs: 7040
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 65536
Exit status: 0

Comment 18 Michael Petlan 2017-09-25 11:50:58 UTC
Reproduced with devtoolset-6-gcc on a Mustang. It is enough to compile rhs.f.
$ time scl enable devtoolset-6 -- gfortran -c  -O3 -fopenmp -mcmodel=large rhs.f

Without DTS (using system gcc) or with DTS-7, it compiles in about two seconds.
VERIFIED for devtoolset-7-gcc-gfortran-7.2.1-1.el7.aarch64.

Comment 20 Michael Petlan 2017-09-25 11:53:52 UTC
.... and, results from the time command from comment #18.

real	21m59.579s
user	0m0.129s
sys	0m0.015s

(after hitting Ctrl-C)

Comment 22 errata-xmlrpc 2017-10-24 09:47:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3016


Note You need to log in before you can comment on or make changes to this bug.