Created attachment 1250831 [details] Reproducer Hello everyone, on APM X-Gene CPU Potenza A3 with devtools-6 enabled (which means gcc version 6.3.1 20170118 ) compiling of LU benchmark (one of NAS Parallel benchmarks [1]). takes for some strange reasons about 45 minutes, usually it should take about minute or two. Version-Release number of selected component (if applicable): gcc version 6.3.1 20170118 How reproducible: Compile lu becnhmark from NAS Parallel benchmarks - reproducer attached Steps to Reproduce: 1. download attached tar 2. untar it 3. execute reproducer.sh Actual results: real 48m1.888s user 48m1.913s sys 0m0.170s Expected results: Minute or two Additional info: [1] https://www.nas.nasa.gov/publications/npb.html
Some more comments: 1) We need to use gcc version >= 6.3 See https://bugzilla.redhat.com/show_bug.cgi?id=1389276#c9 2) On HP m400 the compilation takes under 1 minute. The problem is specific to Mustang systems. Jirka
With the same command line options (and no -march=native)? Then the only reason I can think of would be you don't have enough memory and swap to death.
Both systems (HP m400 and Mustang) has the same amount of RAM - 16GB. @Petr - could you please check the exact command line options being used?
(In reply to Jiri Hladky from comment #4) > Both systems (HP m400 and Mustang) has the same amount of RAM - 16GB. The HP m400 systems have 64GB RAM. [root@hp-moonshot-03-c01 ~]# free -g total used free shared buff/cache available Mem: 63 0 61 0 1 57 Swap: 11 0 11
It seems to be related to the -O3 optimizations. I removed that flag from config/make.def and the build time is much better without it. I was watching the system with 'top' while building with -O3 and the memory usage was normal (plenty of free RAM, no swap used), the CPU was just pegged at 100% trying to do O3 optimizations. :::::::::::::: :: With -O3 :: :::::::::::::: [jbastian@centipede NPB3.3-OMP]$ which gfortran /opt/rh/devtoolset-6/root/usr/bin/gfortran [jbastian@centipede NPB3.3-OMP]$ gfortran --version | head -1 GNU Fortran (GCC) 6.3.1 20170118 (Red Hat 6.3.1-2) [jbastian@centipede NPB3.3-OMP]$ time make lu CLASS=C ... gfortran -c -O3 -fopenmp -mcmodel=large rhs.f ^C make: *** [lu] Interrupt real 7m48.872s user 0m4.571s sys 0m0.134s ::::::::::::::::: :: Without -O3 :: ::::::::::::::::: [jbastian@centipede NPB3.3-OMP]$ vi config/make.def [jbastian@centipede NPB3.3-OMP]$ make clean ... [jbastian@centipede NPB3.3-OMP]$ time make lu CLASS=C ... gfortran -c -fopenmp -mcmodel=large rhs.f gfortran -c -fopenmp -mcmodel=large l2norm.f ... gfortran -fopenmp -mcmodel=large -o ../bin/lu.C.x lu.o read_input.o domain.o setcoeff.o setbv.o exact.o setiv.o erhs.o ssor.o rhs.o l2norm.o jacld.o blts.o jacu.o buts.o error.o syncs.o pintgr.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o make[2]: Leaving directory '/home/jbastian/NPB/bz1422848/NPB3.3-OMP/LU' make[1]: Leaving directory '/home/jbastian/NPB/bz1422848/NPB3.3-OMP/LU' real 0m3.015s user 0m2.583s sys 0m0.200s
Using -O2 also works well: [jbastian@centipede NPB3.3-OMP]$ time make lu CLASS=C ... gfortran -c -O2 -fopenmp -mcmodel=large ssor.f gfortran -c -O2 -fopenmp -mcmodel=large rhs.f gfortran -c -O2 -fopenmp -mcmodel=large l2norm.f ... real 0m6.423s user 0m6.053s sys 0m0.133s
Petr, I'm not able to reproduce your HP m400 results. That is, it also gets stuck in O3 optimizations for a very long time for me on an HP m400 system. [jbastian@hp-moonshot-03-c01 NPB3.3-OMP]$ which gfortran /opt/rh/devtoolset-6/root/usr/bin/gfortran [jbastian@hp-moonshot-03-c01 NPB3.3-OMP]$ gfortran --version | head -1 GNU Fortran (GCC) 6.3.1 20170118 (Red Hat 6.3.1-2) [jbastian@hp-moonshot-03-c01 NPB3.3-OMP]$ make clean ... [jbastian@hp-moonshot-03-c01 NPB3.3-OMP]$ time make lu CLASS=C ... gfortran -c -O3 -fopenmp -mcmodel=large rhs.f ^C make: *** [lu] Interrupt real 2m35.502s user 0m4.668s sys 0m0.073s While it was building, the top output (hiding idle processes): top - 16:49:39 up 2 days, 22:46, 2 users, load average: 0.83, 0.32, 0.20 Tasks: 222 total, 3 running, 219 sleeping, 0 stopped, 0 zombie %Cpu(s): 12.5 us, 0.0 sy, 0.0 ni, 87.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 66940352 total, 63613312 free, 1152320 used, 2174720 buff/cache KiB Swap: 11722688 total, 11722688 free, 0 used. 59443008 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24482 jbastian 20 0 455808 309056 16064 R 100.0 0.5 1:44.36 f951 511 root 20 0 0 0 0 S 0.3 0.0 0:04.85 xfsaild/dm+ 24522 root 20 0 126016 8256 3968 R 0.3 0.0 0:00.06 top
Jirka, flags are -O3 -fopenmp -mcmodel=large or in some cases without -O3 Jeff, it seams I am not using -O3 on HP machine which explains why I missed performance problem.
Hello everyone, short summary of current status. It seems that compilation problem is caused by optimization switch, to be concrete -O3. Compilation time for -O3 is about 50 minutes on both systems (HP m400 and Mustang), -O2 compilation time is about minute or two. As for memory usage, both systems have more than enough memory (64 and 16 GB RAM). /usr/bin/time -v says that Maximum resident set size (kbytes): 342848. (Full log bellow) If you need any more information, please let me know. Log form mustang system Command being timed: "make lu CLASS=C" User time (seconds): 2884.58 System time (seconds): 0.24 Percent of CPU this job got: 100% Elapsed (wall clock) time (h:mm:ss or m:ss): 48:04.61 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 342848 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 27944 Voluntary context switches: 220 Involuntary context switches: 1090 Swaps: 0 File system inputs: 0 File system outputs: 7040 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 65536 Exit status: 0 Log from HP m400 system Command being timed: "make lu CLASS=C" User time (seconds): 3226.08 System time (seconds): 0.25 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 53:46.40 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 342848 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 27944 Voluntary context switches: 275 Involuntary context switches: 994 Swaps: 0 File system inputs: 1096 File system outputs: 7040 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 65536 Exit status: 0
Reproduced with devtoolset-6-gcc on a Mustang. It is enough to compile rhs.f. $ time scl enable devtoolset-6 -- gfortran -c -O3 -fopenmp -mcmodel=large rhs.f Without DTS (using system gcc) or with DTS-7, it compiles in about two seconds. VERIFIED for devtoolset-7-gcc-gfortran-7.2.1-1.el7.aarch64.
.... and, results from the time command from comment #18. real 21m59.579s user 0m0.129s sys 0m0.015s (after hitting Ctrl-C)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3016