On AMD EPYC3 platform, I have compiled the linpack benchmark with the following three settings using gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4) from F38: CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v2 CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3 CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3 -mtune=native $ grep -PoH "[0-9]+ Kflops" *log linpackd_x86-64-v2.log:5739246 Kflops linpackd_x86-64-v3.log:4440715 Kflops linpackd_x86-64-v3_mtune_native.log:4518391 Kflops As you can see, with -march=x86-64-v3, the resulting binary is slower by 22% than with -march=x86-64-v2. This is true even if I use -mtune=native. I'm puzzled by this result. x86-64-v3 (see https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) enables AVX instructions, which should help the performance. Could somebody please help to analyze what's happening? I have used Intel SDE tool https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html to verify that the binary compiled with -march=x86-64-v3 indeed uses the AVX instructions. wget https://downloadmirror.intel.com/784319/sde-external-9.24.0-2023-07-13-lin.tar.xz tar xvf sde-external-9.24.0-2023-07-13-lin.tar.xz sde-external-9.24.0-2023-07-13-lin/sde64 -mix -omix instruction.hist -- ./linpackd_x86-64-v3 isa-ext-AVX went up from 202195 to 17731412574 instructions and isa-set-SSE2 went down from 35140277958 to 92179 instructions when moving from -march=x86-64-v2 to -march=x86-64-v3 $ grep -H --max-count=1 isa-ext-AVX *hist linpackd_x86-64-v2.hist:*isa-ext-AVX 202195 linpackd_x86-64-v3.hist:*isa-ext-AVX 17731412574 linpackd_x86-64-v3_mtune_native.hist:*isa-ext-AVX 17744399086 $ grep -H --max-count=1 isa-set-SSE2 *hist linpackd_x86-64-v2.hist:*isa-set-SSE2 35140277958 linpackd_x86-64-v3.hist:*isa-set-SSE2 92179 linpackd_x86-64-v3_mtune_native.hist:*isa-set-SSE2 92176 Reproducible: Always Steps to Reproduce: 1. dnf install hwloc-devel 2. make run 3. Compare the performance reported: grep -PoH "[0-9]+ Kflops" *log 4. Optional - install SDE tool from Intel - https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html 4.1 Update path to sde64 binary in script generate_instruction_histogram.sh 4.2 ./generate_instruction_histogram.sh to generate histogram of instruction usage 4.3 linpackd_x86-64-v2.log:5739246 Kflops linpackd_x86-64-v3.log:4440715 Kflops linpackd_x86-64-v3_mtune_native.log:4518391 Kflops Actual Results: Binary compiled with -march=x86-64-v3 is slower by 22% than binary compiled with -march=x86-64-v2. This happens even if I use -mtune=native. Reproduced on AMD EPYC3 platform. Intel platform (tested on Icelake and Sapphire Rapids) is not affected. Expected Results: Binary compiled with -march=x86-64-v3 performs at least at the same level as binary compiled with -march=x86-64-v2.
Created attachment 1980485 [details] A standalone reproducer Steps to Reproduce: 1. dnf install hwloc-devel 2. make run 3. Compare the performance reported: grep -PoH "[0-9]+ Kflops" *log See included README file for more details.
GCC generates a tight loop with the FMA chain. For Zen-based AMD CPUs, the FMA chain with dependency causes a regression. A patch in GCC is being developed https://gcc.gnu.org/legacy-ml/gcc-patches/2017-12/msg01053.html. The solution is to compile it with -mno-fma We will need to wait till the patch is complete and merged into GCC.
Created attachment 1981064 [details] Updated testcase with -mno-fma variant Update testcase with -mno-fma variant It includes all results from AMD EPYC 7573X 32-Core server. 1) Log files with results $ grep -PoH "[0-9]+ Kflops" *log linpackd_x86-64-v2.log:5682088 Kflops linpackd_x86-64-v3.log:4452227 Kflops linpackd_x86-64-v3_mtune_native.log:4506376 Kflops linpackd_x86-64-v3_mtune_native-no_fma.log:5645570 Kflops linpackd_x86-64-v3-no_fma.log:5744116 Kflops 2) Instruction usage histogram $ ls *hist | cat linpackd_x86-64-v2.hist linpackd_x86-64-v3.hist linpackd_x86-64-v3_mtune_native.hist linpackd_x86-64-v3_mtune_native-no_fma.hist linpackd_x86-64-v3-no_fma.hist
That patch is in GCC for 5 years, so either it doesn't work in this case, or it is some other related bug. The current setting is /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or smaller FMA chain. */ DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3) /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or smaller FMA chain. */ DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM) What -mtune= you get for -mtune=native in your case? gcc -S -mtune=native -v -xc /dev/null -o /dev/null 2>&1 | grep -v mtune should show that... If it is -mtune=znver4, perhaps we need to add | m_ZNVER4 to some of those. With -mtune=generic (the default) this workaround is not in effect.
(In reply to Jakub Jelinek from comment #4) > With -mtune=generic (the default) this workaround is not in effect. And -march=x86-64-v3 should use those default tunings. Maybe we should have an upstream discussion whether we should change the default tuning.
We certainly shouldn't change the default tuning (that we tune with -mtune=generic), what we could consider including | m_GENERIC in those. But that requires wide discussions between Intel and AMD, as -mtune=generic is tuning for recent chips from both of those vendors, and it matters how much it gains for some CPUs and how much it makes things slower on others.
Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21. Fedora Linux 38 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.