Description of problem: gcc compiled with -march=x86-64-v3 takes longer to compile PHP sources. Here is the GCC package compiled with -march=x86-64-v3 https://buildlogs.centos.org/9-stream/isa/x86_64/packages-optimized/Packages/g/gcc-12.2.1-4.el9sopt.x86_64.rpm GCC V12 x86-64-v3 OPTIMIZED has these changed defaults compared to GCC V11: -march=x86-64-v3 -ftree-vectorize We observe the following runtimes: Compilation with GCC V11 is faster than with GCC V12 x86-64-v3 Optimized. With 72 parallel jobs, runtime increases from 41 to 53 seconds. With one job (make -j 1), runtime increases from 321 seconds to 358 seconds. Runtimes on https://beaker.engineering.redhat.com/view/intel-icelake-platinum-8351n-1s.lab.eng.brq2.redhat.com#details GCC V11 - gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4) 72 threads: 574.31user 65.72system 0:41.25elapsed 1551%CPU (0avgtext+0avgdata 966408maxresident)k 1 thread: 299.77user 23.79system 5:21.01elapsed 100%CPU (0avgtext+0avgdata 971152maxresident)k GCC V12 x86-64-v3 OPTIMIZED - gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4) - gcc-12.2.1-4.el9sopt.x86_64.rpm 72 threads: 616.48user 65.37system 0:53.17elapsed 1282%CPU (0avgtext+0avgdata 947592maxresident)k 1 thread: 336.83user 24.09system 5:58.45elapsed 100%CPU (0avgtext+0avgdata 952576maxresident)k I have tried to turn on the auto-vectorization and x86-64-v3 with GCC V11, but against my expectations, the runtime has decreased to just 17 seconds! $ ./configure --without-sqlite3 --without-pdo-sqlite CFLAGS="-ftree-vectorize -march=x86-64-v3" 17 threads: 237.23user 61.20system 0:17.14elapsed 1740%CPU (0avgtext+0avgdata 620640maxresident)k I have also tried compilation with gcc-toolset-12 from RHEL-9.2, and the runtime was just 20 seconds with "-ftree-vectorize -march=x86-64-v3" CFLAGS - only slightly worse than with GCC V11. gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7) 331.99user 83.56system 0:20.66elapsed 2010%CPU (0avgtext+0avgdata 686288maxresident)k The runtimes for gcc-12.2.1-4.el9sopt (compiled itself with -march=x86-64-v3) are higher than for GCC V11 and gcc-toolset-12. This is surprising. We would expect it to benefit ISA v3 instructions, in particular from SHLX/SHRX, and maybe TZCNT/LZCNT and some marginally shorter instruction sequences for inline memcpy and memset, resulting in shorter compile times. How reproducible: ALWAYS Steps to Reproduce: Reproducer (on Icelake server with 72 cpus it completes in roughly one minute). =========================================================== dnf install libxml2-devel wget http://mirror.cogentco.com/pub/php/php-7.4.2.tar.bz2 tar xvf php-7.4.2.tar.bz2 cd php-7.4.2/ ./configure --without-sqlite3 --without-pdo-sqlite /usr/bin/time --output=time.txt make -s -j $(nproc) OR /usr/bin/time --output=time.txt make -s -j 1 Check file time.txt =========================================================== Compare compilation runtimes for GCC V11/V12 with "-ftree-vectorize -march=x86-64-v3" against GCC V12 x86-64-v3 OPTIMIZED (to be retrieved from https://buildlogs.centos.org/9-stream/isa/x86_64/packages-optimized/Packages/g/gcc-12.2.1-4.el9sopt.x86_64.rpm) Actual results: GCC V12 x86-64-v3 OPTIMIZED takes longer to compile PHP source code than GCC V11 and V12 compiled with RHEL-9.2 defaults -march=x86-64-v2. Moreover, we have shown that longer compilations times are not due to the changed default settings in GCC V12 x86-64-v3 OPTIMIZED, namely -march=x86-64-v3 and -ftree-vectorize. Expected results: We expect GCC V12 x86-64-v3 OPTIMIZED to have the same or better compilation times. Additional info:
Both Intel Icelake and AMD Epyc3 have the runtime degradation: http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/Phoronix/amd-epyc3-milan-7313-2s.tpb.lab.eng.brq.redhat.com/RHEL-9.3.0-20230718.0vsRHEL-9.2.0/2023-07-20T14:03:11.500000vs2023-07-20T12:35:20.100000/7e0944de-b3af-591f-bb5c-65f432f6f1fb/index.html#build-php_section http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/Phoronix/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.3.0-20230718.0vsRHEL-9.2.0/2023-07-20T14:03:11.500000vs2023-07-20T12:35:20.100000/ef759836-121e-5ca3-9dbf-1dbddb276410/index.html#build-php_section
Michey Mehta has analyzed the slow down and here are the key takeaways: 1) The slowdown is entirely unrelated to GCC being compiled with x86-64-v3. There is no need to use any extra repos to reproduce the problem. Using default GCC v11 from RHEL-9.2 and gcc-toolset-12 clearly shows the problem. 2) Compiling file parse_date.c (this file is a preprocessed version of a file in the PHP sources) shows the problem: Compile using this: time gcc -ftime-report -O2 -fno-tree-vectorize -march=x86-64-v2 -c parse_date.c gcc 11 takes about 7s, gcc 12 takes about 18s (on a AMD EPYC 7573X 32-Core Processor) 3) perf showed that iterate_fix_dominators got the most hits in gcc 12 (about 10%) 4) this is also seen in the ftime-report: for gcc 12, "dominance computation" takes 7s compared to 0.37 on gcc 11. I have reproduced the problem on amd-epyc3-milanx-7573x-2s.lab.eng.brq2.redhat.com with AMD EPYC 7573X 32-Core Processor using these gcc versions. gcc v11: gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4) gcc v12: gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7) I'm going to upload a tiny self-contained reproducer. Could you please review the slowdown and decide whether this is expected? Thanks a lot Jirka
I'm afraid I'm lost in what you're actually measuring. Two versions of the same compiler built with different ISA flags (like -march=...) measured with the same flags used to build PHP (that would show how those ISA flags improve or don't compilation speed on the workload), or the same compiler with 2 different sets of options (e.g. different ISA flags etc.) when building the workload (in this case I'd note that it is far more important whether the generated code is faster/smaller than any compilation speed differences), or comparing two different versions of compiler built with the same ISA flags and with same options on the workload (I'd say that this in this case it is even far more important how well is the generated code optimized than compilation speed), or some weird mix of these (then it is hard to guess). E.g. GCC 12 compared to GCC 11 enables vectorization by default at -O2, while GCC 11 didn't, that can result in larger compile time which greatly pays off if the generated code is faster.
Created attachment 1981305 [details] Standalone reproducer with results To reproduce the problem, do the following. Install RHEL-9.2 and use default gcc v11 compiler. ./test.sh scl enable gcc-toolset-12 'bash' ./test.sh Compare generated log files. In my case: $grep real *log 2023-Aug-02_12h24m38s_11.3.1_20221121.log:real 0m7.233s 2023-Aug-02_13h26m43s_12.2.1_20221121.log:real 0m18.097s $grep "dominance computation" *log 2023-Aug-02_12h24m38s_11.3.1_20221121.log: dominance computation : 0.37 ( 5%) 0.00 ( 0%) 0.39 ( 5%) 0 ( 0%) 2023-Aug-02_13h26m43s_12.2.1_20221121.log: dominance computation : 7.00 ( 40%) 0.00 ( 0%) 7.23 ( 40%) 0 ( 0%)
Hi Jakub, I'm sorry for the confusion. We found the issue when recompiling RHEL-9.2 userspace packages with march=x86-64-v3. As described in comment #2, later, we found that this is entirely unrelated to x86-64-v3 recompilation. The current issue is that this compilation: time gcc -ftime-report -O2 -fno-tree-vectorize -march=x86-64-v2 -c parse_date.c takes 7 seconds with gcc v11 from RHEL-9.2 and 18 seconds with gcc v12 from gcc-toolset-12. Please note that we explicitly disable vectorization to make the comparison more fair. I'm unable to assess if this is a real issue. As you noted, the increased compilation time can pay off if the resulting code is faster. I will leave the decision on you. Could you please get the testcase from comment #4, run it on RHEL-9.2, and judge whether this is a real problem? If yes, I can open a new BZ to avoid confusion. Feel free to close this BZ if this is not a real issue. Thanks a lot for your help! Jirka
Sorry, this was a misunderstanding. The original compiler flags I saw did not include -O2, so I assumed the benchmark evaluated the compilation speed without optimization, which is arguably a more well-defined target. With -O2 and other optimization levels, there of course complicated trade-offs between compile-time and run-time performance.
I can reproduce it, but current gcc trunk is back at gcc 11 time (17.38s gcc 11, 30.80s gcc 12, 17.59s gcc trunk), so it doesn't seem to be worth even investigating, as we wouldn't be changing GCC 12 because of this anyway. And compile time is hard to bisect on our gcc bisect seed, as everything there is unoptimized builds.