Bug 2228124

Summary: gcc 12 takes up-to 2.5x longer to compile PHP source code compared to gcc 11
Product: Red Hat Enterprise Linux 9 Reporter: Jiri Hladky <jhladky>
Component: gccAssignee: Marek Polacek <mpolacek>
gcc sub component: gcc-toolset-12 QA Contact: qe-baseos-tools-bugs
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: ahajkova, fweimer, jakub, jhladky, jmario, jvozar, kkolakow, mimehta, ohudlick, sipoyare
Version: 9.2Keywords: Performance
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-02 13:10:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Standalone reproducer with results none

Description Jiri Hladky 2023-08-01 13:08:37 UTC
Description of problem:

gcc compiled with -march=x86-64-v3 takes longer to compile PHP sources. 


Here is the GCC package compiled with -march=x86-64-v3
https://buildlogs.centos.org/9-stream/isa/x86_64/packages-optimized/Packages/g/gcc-12.2.1-4.el9sopt.x86_64.rpm

GCC V12 x86-64-v3 OPTIMIZED has these changed defaults compared to GCC V11:
-march=x86-64-v3
-ftree-vectorize

We observe the following runtimes:
Compilation with GCC V11 is faster than with GCC V12 x86-64-v3 Optimized. With 72 parallel jobs, runtime increases from 41 to 53 seconds. With one job (make -j 1), runtime increases from 321 seconds to 358 seconds.

Runtimes on https://beaker.engineering.redhat.com/view/intel-icelake-platinum-8351n-1s.lab.eng.brq2.redhat.com#details
GCC V11 - gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)
72 threads: 574.31user 65.72system 0:41.25elapsed 1551%CPU (0avgtext+0avgdata 966408maxresident)k
1 thread:   299.77user 23.79system 5:21.01elapsed 100%CPU (0avgtext+0avgdata 971152maxresident)k

GCC V12 x86-64-v3 OPTIMIZED - gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4) - gcc-12.2.1-4.el9sopt.x86_64.rpm
72 threads: 616.48user 65.37system 0:53.17elapsed 1282%CPU (0avgtext+0avgdata 947592maxresident)k
1 thread:   336.83user 24.09system 5:58.45elapsed 100%CPU (0avgtext+0avgdata 952576maxresident)k

I have tried to turn on the auto-vectorization and x86-64-v3 with GCC V11, but against my expectations, the runtime has decreased to just 17 seconds!

$ ./configure --without-sqlite3 --without-pdo-sqlite CFLAGS="-ftree-vectorize -march=x86-64-v3"
17 threads: 237.23user 61.20system 0:17.14elapsed 1740%CPU (0avgtext+0avgdata 620640maxresident)k

I have also tried compilation with gcc-toolset-12 from RHEL-9.2, and the runtime was just 20 seconds with "-ftree-vectorize -march=x86-64-v3" CFLAGS - only slightly worse than with GCC V11. 
gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7)
331.99user 83.56system 0:20.66elapsed 2010%CPU (0avgtext+0avgdata 686288maxresident)k

The runtimes for gcc-12.2.1-4.el9sopt (compiled itself with -march=x86-64-v3) are higher than for GCC V11 and gcc-toolset-12. This is surprising. We would expect it to benefit ISA v3 instructions, in particular from SHLX/SHRX, and maybe TZCNT/LZCNT and some marginally shorter instruction sequences for inline memcpy and memset, resulting in shorter compile times. 


How reproducible: ALWAYS


Steps to Reproduce:
Reproducer (on Icelake server with 72 cpus it completes in roughly one minute).
===========================================================
dnf install libxml2-devel
wget http://mirror.cogentco.com/pub/php/php-7.4.2.tar.bz2
tar xvf php-7.4.2.tar.bz2
cd php-7.4.2/
./configure --without-sqlite3 --without-pdo-sqlite
/usr/bin/time --output=time.txt make -s -j $(nproc)
OR
/usr/bin/time --output=time.txt make -s -j 1
Check file time.txt
===========================================================

Compare compilation runtimes for GCC V11/V12 with "-ftree-vectorize -march=x86-64-v3" against GCC V12 x86-64-v3 OPTIMIZED (to be retrieved from https://buildlogs.centos.org/9-stream/isa/x86_64/packages-optimized/Packages/g/gcc-12.2.1-4.el9sopt.x86_64.rpm)


Actual results:
GCC V12 x86-64-v3 OPTIMIZED takes longer to compile PHP source code than GCC V11 and V12 compiled with RHEL-9.2 defaults -march=x86-64-v2. Moreover, we have shown that longer compilations times are not due to the changed default settings in GCC V12 x86-64-v3 OPTIMIZED, namely -march=x86-64-v3 and -ftree-vectorize.


Expected results:
We expect GCC V12 x86-64-v3 OPTIMIZED to have the same or better compilation times. 


Additional info:

Comment 2 Jiri Hladky 2023-08-02 12:05:16 UTC
Michey Mehta has analyzed the slow down and here are the key takeaways:

1) The slowdown is entirely unrelated to GCC being compiled with x86-64-v3. There is no need to use any extra repos to reproduce the problem. Using default GCC v11 from RHEL-9.2 and gcc-toolset-12 clearly shows the problem. 

2) Compiling file parse_date.c (this file is a preprocessed version of a file in the PHP sources) shows the problem:

Compile using this:
time gcc -ftime-report -O2 -fno-tree-vectorize -march=x86-64-v2 -c parse_date.c

gcc 11 takes about 7s, gcc 12 takes about 18s (on a AMD EPYC 7573X 32-Core Processor)

3) perf showed that iterate_fix_dominators got the most hits in gcc 12 (about 10%)

4) this is also seen in the ftime-report: for gcc 12, "dominance computation" takes 7s compared to 0.37 on gcc 11.

I have reproduced the problem on amd-epyc3-milanx-7573x-2s.lab.eng.brq2.redhat.com with AMD EPYC 7573X 32-Core Processor using these gcc versions. 

gcc v11: gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)
gcc v12: gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7)

I'm going to upload a tiny self-contained reproducer. 

Could you please review the slowdown and decide whether this is expected? 

Thanks a lot
Jirka

Comment 3 Jakub Jelinek 2023-08-02 12:08:35 UTC
I'm afraid I'm lost in what you're actually measuring.  Two versions of the same compiler built with different ISA flags (like -march=...) measured with the same flags used to build PHP (that would show how those ISA flags improve or don't compilation speed on the workload), or the same compiler with 2 different sets of options (e.g. different ISA flags etc.) when building the workload (in this case I'd note that it is far more important whether the generated code is faster/smaller than any compilation speed differences), or comparing two different versions of compiler built with the same ISA flags and with same options on the workload (I'd say that this in this case it is even far more important how well is the generated code optimized than compilation speed), or some weird mix of these (then it is hard to guess).
E.g. GCC 12 compared to GCC 11 enables vectorization by default at -O2, while GCC 11 didn't, that can result in larger compile time which greatly pays off if the generated code is faster.

Comment 4 Jiri Hladky 2023-08-02 12:09:48 UTC
Created attachment 1981305 [details]
Standalone reproducer with results

To reproduce the problem, do the following.

Install RHEL-9.2 and use default gcc v11 compiler.
./test.sh
scl enable gcc-toolset-12 'bash'
./test.sh

Compare generated log files. In my case:

$grep real *log
2023-Aug-02_12h24m38s_11.3.1_20221121.log:real  0m7.233s
2023-Aug-02_13h26m43s_12.2.1_20221121.log:real  0m18.097s

$grep "dominance computation" *log
2023-Aug-02_12h24m38s_11.3.1_20221121.log: dominance computation              :   0.37 (  5%)   0.00 (  0%)   0.39 (  5%)     0  (  0%)
2023-Aug-02_13h26m43s_12.2.1_20221121.log: dominance computation              :   7.00 ( 40%)   0.00 (  0%)   7.23 ( 40%)     0  (  0%)

Comment 5 Jiri Hladky 2023-08-02 12:34:50 UTC
Hi Jakub,

I'm sorry for the confusion. We found the issue when recompiling RHEL-9.2 userspace packages with march=x86-64-v3. 

As described in comment #2, later, we found that this is entirely unrelated to x86-64-v3 recompilation. 

The current issue is that this compilation:

time gcc -ftime-report -O2 -fno-tree-vectorize -march=x86-64-v2 -c parse_date.c

takes 7 seconds with gcc v11 from RHEL-9.2 and 18 seconds with gcc v12 from gcc-toolset-12.

Please note that we explicitly disable vectorization to make the comparison more fair. 

I'm unable to assess if this is a real issue. As you noted, the increased compilation time can pay off if the resulting code is faster. 

I will leave the decision on you. Could you please get the testcase from comment #4, run it on RHEL-9.2, and judge whether this is a real problem? If yes, I can open a new BZ to avoid confusion. Feel free to close this BZ if this is not a real issue. 

Thanks a lot for your help!
Jirka

Comment 6 Florian Weimer 2023-08-02 13:10:07 UTC
Sorry, this was a misunderstanding. The original compiler flags I saw did not include -O2, so I assumed the benchmark evaluated the compilation speed without optimization, which is arguably a more well-defined target. With -O2 and other optimization levels, there of course complicated trade-offs between compile-time and run-time performance.

Comment 7 Jakub Jelinek 2023-08-02 14:02:18 UTC
I can reproduce it, but current gcc trunk is back at gcc 11 time (17.38s gcc 11, 30.80s gcc 12, 17.59s gcc trunk), so it doesn't seem to be worth even investigating, as we wouldn't be changing GCC 12 because of this anyway.
And compile time is hard to bisect on our gcc bisect seed, as everything there is unoptimized builds.