Bug 2227278

Summary: On AMD EPYC3, Linpack benchmark compiled with -march=x86-64-v3 runs slower by 22% compared to -march=x86-64-v2
Product: [Fedora] Fedora Reporter: Jiri Hladky <jhladky>
Component: gccAssignee: Jakub Jelinek <jakub>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 38CC: dmalcolm, fweimer, jakub, jlaw, jwakely, mcermak, mpolacek, msebor, nickc, sipoyare
Target Milestone: ---Keywords: Performance
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Updated testcase with -mno-fma variant none

Description Jiri Hladky 2023-07-28 15:28:55 UTC
On AMD EPYC3 platform, I have compiled the linpack benchmark with the following three settings using gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4) from F38:

CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v2
CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3
CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3 -mtune=native

$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5739246 Kflops
linpackd_x86-64-v3.log:4440715 Kflops
linpackd_x86-64-v3_mtune_native.log:4518391 Kflops

As you can see, with -march=x86-64-v3, the resulting binary is slower by 22% than with -march=x86-64-v2. This is true even if I use -mtune=native. 

I'm puzzled by this result. x86-64-v3 (see https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) enables AVX instructions, which should help the performance. 

Could somebody please help to analyze what's happening? I have used Intel SDE tool
https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html

to verify that the binary compiled with -march=x86-64-v3 indeed uses the AVX instructions. 

wget https://downloadmirror.intel.com/784319/sde-external-9.24.0-2023-07-13-lin.tar.xz
tar xvf sde-external-9.24.0-2023-07-13-lin.tar.xz
sde-external-9.24.0-2023-07-13-lin/sde64 -mix -omix instruction.hist -- ./linpackd_x86-64-v3

isa-ext-AVX went up from 202195 to 17731412574 instructions and
isa-set-SSE2 went down from 35140277958 to 92179 instructions when moving from -march=x86-64-v2 to -march=x86-64-v3

$ grep -H --max-count=1 isa-ext-AVX *hist
linpackd_x86-64-v2.hist:*isa-ext-AVX                                                      202195
linpackd_x86-64-v3.hist:*isa-ext-AVX                                                 17731412574
linpackd_x86-64-v3_mtune_native.hist:*isa-ext-AVX                                                 17744399086

$ grep -H --max-count=1 isa-set-SSE2 *hist
linpackd_x86-64-v2.hist:*isa-set-SSE2                                                35140277958
linpackd_x86-64-v3.hist:*isa-set-SSE2                                                      92179
linpackd_x86-64-v3_mtune_native.hist:*isa-set-SSE2                                                      92176


Reproducible: Always

Steps to Reproduce:
1.  dnf install hwloc-devel
2.  make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log
4. Optional - install SDE tool from Intel - https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html
4.1 Update path to sde64 binary in script generate_instruction_histogram.sh
4.2 ./generate_instruction_histogram.sh to generate histogram of instruction usage
4.3 
linpackd_x86-64-v2.log:5739246 Kflops
linpackd_x86-64-v3.log:4440715 Kflops
linpackd_x86-64-v3_mtune_native.log:4518391 Kflops

Actual Results:  
Binary compiled with -march=x86-64-v3 is slower by 22% than binary compiled with -march=x86-64-v2. This happens even if I use -mtune=native. Reproduced on AMD EPYC3 platform. Intel platform (tested on Icelake and Sapphire Rapids) is not affected. 

Expected Results:  
Binary compiled with -march=x86-64-v3 performs at least at the same level as binary compiled with -march=x86-64-v2.

Comment 1 Jiri Hladky 2023-07-28 15:42:24 UTC
Created attachment 1980485 [details]
A standalone reproducer

Steps to Reproduce:
1.  dnf install hwloc-devel
2.  make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log

See included README file for more details.

Comment 2 Jiri Hladky 2023-08-01 11:38:05 UTC
GCC generates a tight loop with the FMA chain. 

For Zen-based AMD CPUs, the FMA chain with dependency causes a regression. A patch in GCC is being developed https://gcc.gnu.org/legacy-ml/gcc-patches/2017-12/msg01053.html.

The solution is to compile it with -mno-fma

We will need to wait till the patch is complete and merged into GCC.

Comment 3 Jiri Hladky 2023-08-01 11:39:23 UTC
Created attachment 1981064 [details]
Updated testcase with -mno-fma variant

Update testcase with -mno-fma variant

It includes all results from AMD EPYC 7573X 32-Core server.

1) Log files with results
$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5682088 Kflops
linpackd_x86-64-v3.log:4452227 Kflops
linpackd_x86-64-v3_mtune_native.log:4506376 Kflops
linpackd_x86-64-v3_mtune_native-no_fma.log:5645570 Kflops
linpackd_x86-64-v3-no_fma.log:5744116 Kflops

2) Instruction usage histogram
$ ls *hist | cat
linpackd_x86-64-v2.hist
linpackd_x86-64-v3.hist
linpackd_x86-64-v3_mtune_native.hist
linpackd_x86-64-v3_mtune_native-no_fma.hist
linpackd_x86-64-v3-no_fma.hist

Comment 4 Jakub Jelinek 2023-08-01 11:47:46 UTC
That patch is in GCC for 5 years, so either it doesn't work in this case, or it is some other related bug.
The current setting is
/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
   smaller FMA chain.  */
DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3)

/* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
   smaller FMA chain.  */
DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
          | m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM)
What -mtune= you get for -mtune=native in your case?  gcc -S -mtune=native -v -xc /dev/null -o /dev/null 2>&1 | grep -v mtune
should show that...  If it is -mtune=znver4, perhaps we need to add | m_ZNVER4 to some of those.
With -mtune=generic (the default) this workaround is not in effect.

Comment 5 Florian Weimer 2023-08-01 12:00:25 UTC
(In reply to Jakub Jelinek from comment #4)
> With -mtune=generic (the default) this workaround is not in effect.

And -march=x86-64-v3 should use those default tunings. Maybe we should have an upstream discussion whether we should change the default tuning.

Comment 6 Jakub Jelinek 2023-08-01 12:05:01 UTC
We certainly shouldn't change the default tuning (that we tune with -mtune=generic), what we could consider including | m_GENERIC in those.
But that requires wide discussions between Intel and AMD, as -mtune=generic is tuning for recent chips from both of those vendors, and it matters how much it gains for some CPUs and how much it makes things slower on others.