On AMD EPYC3 platform, I have compiled the linpack benchmark with the following three settings using gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4) from F38: CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v2 CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3 CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3 -mtune=native $ grep -PoH "[0-9]+ Kflops" *log linpackd_x86-64-v2.log:5739246 Kflops linpackd_x86-64-v3.log:4440715 Kflops linpackd_x86-64-v3_mtune_native.log:4518391 Kflops As you can see, with -march=x86-64-v3, the resulting binary is slower by 22% than with -march=x86-64-v2. This is true even if I use -mtune=native. I'm puzzled by this result. x86-64-v3 (see https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) enables AVX instructions, which should help the performance. Could somebody please help to analyze what's happening? I have used Intel SDE tool https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html to verify that the binary compiled with -march=x86-64-v3 indeed uses the AVX instructions. wget https://downloadmirror.intel.com/784319/sde-external-9.24.0-2023-07-13-lin.tar.xz tar xvf sde-external-9.24.0-2023-07-13-lin.tar.xz sde-external-9.24.0-2023-07-13-lin/sde64 -mix -omix instruction.hist -- ./linpackd_x86-64-v3 isa-ext-AVX went up from 202195 to 17731412574 instructions and isa-set-SSE2 went down from 35140277958 to 92179 instructions when moving from -march=x86-64-v2 to -march=x86-64-v3 $ grep -H --max-count=1 isa-ext-AVX *hist linpackd_x86-64-v2.hist:*isa-ext-AVX 202195 linpackd_x86-64-v3.hist:*isa-ext-AVX 17731412574 linpackd_x86-64-v3_mtune_native.hist:*isa-ext-AVX 17744399086 $ grep -H --max-count=1 isa-set-SSE2 *hist linpackd_x86-64-v2.hist:*isa-set-SSE2 35140277958 linpackd_x86-64-v3.hist:*isa-set-SSE2 92179 linpackd_x86-64-v3_mtune_native.hist:*isa-set-SSE2 92176 Reproducible: Always Steps to Reproduce: 1. dnf install hwloc-devel 2. make run 3. Compare the performance reported: grep -PoH "[0-9]+ Kflops" *log 4. Optional - install SDE tool from Intel - https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html 4.1 Update path to sde64 binary in script generate_instruction_histogram.sh 4.2 ./generate_instruction_histogram.sh to generate histogram of instruction usage 4.3 linpackd_x86-64-v2.log:5739246 Kflops linpackd_x86-64-v3.log:4440715 Kflops linpackd_x86-64-v3_mtune_native.log:4518391 Kflops Actual Results: Binary compiled with -march=x86-64-v3 is slower by 22% than binary compiled with -march=x86-64-v2. This happens even if I use -mtune=native. Reproduced on AMD EPYC3 platform. Intel platform (tested on Icelake and Sapphire Rapids) is not affected. Expected Results: Binary compiled with -march=x86-64-v3 performs at least at the same level as binary compiled with -march=x86-64-v2.
Created attachment 1980485 [details] A standalone reproducer Steps to Reproduce: 1. dnf install hwloc-devel 2. make run 3. Compare the performance reported: grep -PoH "[0-9]+ Kflops" *log See included README file for more details.
GCC generates a tight loop with the FMA chain. For Zen-based AMD CPUs, the FMA chain with dependency causes a regression. A patch in GCC is being developed https://gcc.gnu.org/legacy-ml/gcc-patches/2017-12/msg01053.html. The solution is to compile it with -mno-fma We will need to wait till the patch is complete and merged into GCC.
Created attachment 1981064 [details] Updated testcase with -mno-fma variant Update testcase with -mno-fma variant It includes all results from AMD EPYC 7573X 32-Core server. 1) Log files with results $ grep -PoH "[0-9]+ Kflops" *log linpackd_x86-64-v2.log:5682088 Kflops linpackd_x86-64-v3.log:4452227 Kflops linpackd_x86-64-v3_mtune_native.log:4506376 Kflops linpackd_x86-64-v3_mtune_native-no_fma.log:5645570 Kflops linpackd_x86-64-v3-no_fma.log:5744116 Kflops 2) Instruction usage histogram $ ls *hist | cat linpackd_x86-64-v2.hist linpackd_x86-64-v3.hist linpackd_x86-64-v3_mtune_native.hist linpackd_x86-64-v3_mtune_native-no_fma.hist linpackd_x86-64-v3-no_fma.hist
That patch is in GCC for 5 years, so either it doesn't work in this case, or it is some other related bug. The current setting is /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or smaller FMA chain. */ DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3) /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or smaller FMA chain. */ DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM) What -mtune= you get for -mtune=native in your case? gcc -S -mtune=native -v -xc /dev/null -o /dev/null 2>&1 | grep -v mtune should show that... If it is -mtune=znver4, perhaps we need to add | m_ZNVER4 to some of those. With -mtune=generic (the default) this workaround is not in effect.
(In reply to Jakub Jelinek from comment #4) > With -mtune=generic (the default) this workaround is not in effect. And -march=x86-64-v3 should use those default tunings. Maybe we should have an upstream discussion whether we should change the default tuning.
We certainly shouldn't change the default tuning (that we tune with -mtune=generic), what we could consider including | m_GENERIC in those. But that requires wide discussions between Intel and AMD, as -mtune=generic is tuning for recent chips from both of those vendors, and it matters how much it gains for some CPUs and how much it makes things slower on others.