Bug 2227278 - On AMD EPYC3, Linpack benchmark compiled with -march=x86-64-v3 runs slower by 22% compared to -march=x86-64-v2
Summary: On AMD EPYC3, Linpack benchmark compiled with -march=x86-64-v3 runs slower by...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: gcc
Version: 38
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-28 15:28 UTC by Jiri Hladky
Modified: 2023-08-01 12:08 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)
Updated testcase with -mno-fma variant (65.28 KB, application/x-xz)
2023-08-01 11:39 UTC, Jiri Hladky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2227081 1 unspecified ASSIGNED GCC with -march=x86-64-v3 when compiling Linpack benchmarks on AMD EPYC3 results in performance drop by 25% 2023-08-07 15:37:21 UTC

Internal Links: 2227081

Description Jiri Hladky 2023-07-28 15:28:55 UTC
On AMD EPYC3 platform, I have compiled the linpack benchmark with the following three settings using gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4) from F38:

CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v2
CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3
CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3 -mtune=native

$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5739246 Kflops
linpackd_x86-64-v3.log:4440715 Kflops
linpackd_x86-64-v3_mtune_native.log:4518391 Kflops

As you can see, with -march=x86-64-v3, the resulting binary is slower by 22% than with -march=x86-64-v2. This is true even if I use -mtune=native. 

I'm puzzled by this result. x86-64-v3 (see https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) enables AVX instructions, which should help the performance. 

Could somebody please help to analyze what's happening? I have used Intel SDE tool
https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html

to verify that the binary compiled with -march=x86-64-v3 indeed uses the AVX instructions. 

wget https://downloadmirror.intel.com/784319/sde-external-9.24.0-2023-07-13-lin.tar.xz
tar xvf sde-external-9.24.0-2023-07-13-lin.tar.xz
sde-external-9.24.0-2023-07-13-lin/sde64 -mix -omix instruction.hist -- ./linpackd_x86-64-v3

isa-ext-AVX went up from 202195 to 17731412574 instructions and
isa-set-SSE2 went down from 35140277958 to 92179 instructions when moving from -march=x86-64-v2 to -march=x86-64-v3

$ grep -H --max-count=1 isa-ext-AVX *hist
linpackd_x86-64-v2.hist:*isa-ext-AVX                                                      202195
linpackd_x86-64-v3.hist:*isa-ext-AVX                                                 17731412574
linpackd_x86-64-v3_mtune_native.hist:*isa-ext-AVX                                                 17744399086

$ grep -H --max-count=1 isa-set-SSE2 *hist
linpackd_x86-64-v2.hist:*isa-set-SSE2                                                35140277958
linpackd_x86-64-v3.hist:*isa-set-SSE2                                                      92179
linpackd_x86-64-v3_mtune_native.hist:*isa-set-SSE2                                                      92176


Reproducible: Always

Steps to Reproduce:
1.  dnf install hwloc-devel
2.  make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log
4. Optional - install SDE tool from Intel - https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html
4.1 Update path to sde64 binary in script generate_instruction_histogram.sh
4.2 ./generate_instruction_histogram.sh to generate histogram of instruction usage
4.3 
linpackd_x86-64-v2.log:5739246 Kflops
linpackd_x86-64-v3.log:4440715 Kflops
linpackd_x86-64-v3_mtune_native.log:4518391 Kflops

Actual Results:  
Binary compiled with -march=x86-64-v3 is slower by 22% than binary compiled with -march=x86-64-v2. This happens even if I use -mtune=native. Reproduced on AMD EPYC3 platform. Intel platform (tested on Icelake and Sapphire Rapids) is not affected. 

Expected Results:  
Binary compiled with -march=x86-64-v3 performs at least at the same level as binary compiled with -march=x86-64-v2.

Comment 1 Jiri Hladky 2023-07-28 15:42:24 UTC
Created attachment 1980485 [details]
A standalone reproducer

Steps to Reproduce:
1.  dnf install hwloc-devel
2.  make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log

See included README file for more details.

Comment 2 Jiri Hladky 2023-08-01 11:38:05 UTC
GCC generates a tight loop with the FMA chain. 

For Zen-based AMD CPUs, the FMA chain with dependency causes a regression. A patch in GCC is being developed https://gcc.gnu.org/legacy-ml/gcc-patches/2017-12/msg01053.html.

The solution is to compile it with -mno-fma

We will need to wait till the patch is complete and merged into GCC.

Comment 3 Jiri Hladky 2023-08-01 11:39:23 UTC
Created attachment 1981064 [details]
Updated testcase with -mno-fma variant

Update testcase with -mno-fma variant

It includes all results from AMD EPYC 7573X 32-Core server.

1) Log files with results
$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5682088 Kflops
linpackd_x86-64-v3.log:4452227 Kflops
linpackd_x86-64-v3_mtune_native.log:4506376 Kflops
linpackd_x86-64-v3_mtune_native-no_fma.log:5645570 Kflops
linpackd_x86-64-v3-no_fma.log:5744116 Kflops

2) Instruction usage histogram
$ ls *hist | cat
linpackd_x86-64-v2.hist
linpackd_x86-64-v3.hist
linpackd_x86-64-v3_mtune_native.hist
linpackd_x86-64-v3_mtune_native-no_fma.hist
linpackd_x86-64-v3-no_fma.hist

Comment 4 Jakub Jelinek 2023-08-01 11:47:46 UTC
That patch is in GCC for 5 years, so either it doesn't work in this case, or it is some other related bug.
The current setting is
/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
   smaller FMA chain.  */
DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3)

/* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
   smaller FMA chain.  */
DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
          | m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM)
What -mtune= you get for -mtune=native in your case?  gcc -S -mtune=native -v -xc /dev/null -o /dev/null 2>&1 | grep -v mtune
should show that...  If it is -mtune=znver4, perhaps we need to add | m_ZNVER4 to some of those.
With -mtune=generic (the default) this workaround is not in effect.

Comment 5 Florian Weimer 2023-08-01 12:00:25 UTC
(In reply to Jakub Jelinek from comment #4)
> With -mtune=generic (the default) this workaround is not in effect.

And -march=x86-64-v3 should use those default tunings. Maybe we should have an upstream discussion whether we should change the default tuning.

Comment 6 Jakub Jelinek 2023-08-01 12:05:01 UTC
We certainly shouldn't change the default tuning (that we tune with -mtune=generic), what we could consider including | m_GENERIC in those.
But that requires wide discussions between Intel and AMD, as -mtune=generic is tuning for recent chips from both of those vendors, and it matters how much it gains for some CPUs and how much it makes things slower on others.


Note You need to log in before you can comment on or make changes to this bug.