Bug 2227278

Summary: On AMD EPYC3, Linpack benchmark compiled with -march=x86-64-v3 runs slower by 22% compared to -march=x86-64-v2
Product: [Fedora] Fedora Reporter: Jiri Hladky <jhladky>
Component: gccAssignee: Jakub Jelinek <jakub>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 38CC: dmalcolm, fweimer, jakub, jlaw, jwakely, mcermak, mpolacek, msebor, nickc, sipoyare
Target Milestone: ---Keywords: Performance
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-05-28 13:37:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Updated testcase with -mno-fma variant none

Description Jiri Hladky 2023-07-28 15:28:55 UTC
On AMD EPYC3 platform, I have compiled the linpack benchmark with the following three settings using gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4) from F38:

CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v2
CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3
CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3 -mtune=native

$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5739246 Kflops
linpackd_x86-64-v3.log:4440715 Kflops
linpackd_x86-64-v3_mtune_native.log:4518391 Kflops

As you can see, with -march=x86-64-v3, the resulting binary is slower by 22% than with -march=x86-64-v2. This is true even if I use -mtune=native. 

I'm puzzled by this result. x86-64-v3 (see https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) enables AVX instructions, which should help the performance. 

Could somebody please help to analyze what's happening? I have used Intel SDE tool
https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html

to verify that the binary compiled with -march=x86-64-v3 indeed uses the AVX instructions. 

wget https://downloadmirror.intel.com/784319/sde-external-9.24.0-2023-07-13-lin.tar.xz
tar xvf sde-external-9.24.0-2023-07-13-lin.tar.xz
sde-external-9.24.0-2023-07-13-lin/sde64 -mix -omix instruction.hist -- ./linpackd_x86-64-v3

isa-ext-AVX went up from 202195 to 17731412574 instructions and
isa-set-SSE2 went down from 35140277958 to 92179 instructions when moving from -march=x86-64-v2 to -march=x86-64-v3

$ grep -H --max-count=1 isa-ext-AVX *hist
linpackd_x86-64-v2.hist:*isa-ext-AVX                                                      202195
linpackd_x86-64-v3.hist:*isa-ext-AVX                                                 17731412574
linpackd_x86-64-v3_mtune_native.hist:*isa-ext-AVX                                                 17744399086

$ grep -H --max-count=1 isa-set-SSE2 *hist
linpackd_x86-64-v2.hist:*isa-set-SSE2                                                35140277958
linpackd_x86-64-v3.hist:*isa-set-SSE2                                                      92179
linpackd_x86-64-v3_mtune_native.hist:*isa-set-SSE2                                                      92176


Reproducible: Always

Steps to Reproduce:
1.  dnf install hwloc-devel
2.  make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log
4. Optional - install SDE tool from Intel - https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html
4.1 Update path to sde64 binary in script generate_instruction_histogram.sh
4.2 ./generate_instruction_histogram.sh to generate histogram of instruction usage
4.3 
linpackd_x86-64-v2.log:5739246 Kflops
linpackd_x86-64-v3.log:4440715 Kflops
linpackd_x86-64-v3_mtune_native.log:4518391 Kflops

Actual Results:  
Binary compiled with -march=x86-64-v3 is slower by 22% than binary compiled with -march=x86-64-v2. This happens even if I use -mtune=native. Reproduced on AMD EPYC3 platform. Intel platform (tested on Icelake and Sapphire Rapids) is not affected. 

Expected Results:  
Binary compiled with -march=x86-64-v3 performs at least at the same level as binary compiled with -march=x86-64-v2.

Comment 1 Jiri Hladky 2023-07-28 15:42:24 UTC
Created attachment 1980485 [details]
A standalone reproducer

Steps to Reproduce:
1.  dnf install hwloc-devel
2.  make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log

See included README file for more details.

Comment 2 Jiri Hladky 2023-08-01 11:38:05 UTC
GCC generates a tight loop with the FMA chain. 

For Zen-based AMD CPUs, the FMA chain with dependency causes a regression. A patch in GCC is being developed https://gcc.gnu.org/legacy-ml/gcc-patches/2017-12/msg01053.html.

The solution is to compile it with -mno-fma

We will need to wait till the patch is complete and merged into GCC.

Comment 3 Jiri Hladky 2023-08-01 11:39:23 UTC
Created attachment 1981064 [details]
Updated testcase with -mno-fma variant

Update testcase with -mno-fma variant

It includes all results from AMD EPYC 7573X 32-Core server.

1) Log files with results
$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5682088 Kflops
linpackd_x86-64-v3.log:4452227 Kflops
linpackd_x86-64-v3_mtune_native.log:4506376 Kflops
linpackd_x86-64-v3_mtune_native-no_fma.log:5645570 Kflops
linpackd_x86-64-v3-no_fma.log:5744116 Kflops

2) Instruction usage histogram
$ ls *hist | cat
linpackd_x86-64-v2.hist
linpackd_x86-64-v3.hist
linpackd_x86-64-v3_mtune_native.hist
linpackd_x86-64-v3_mtune_native-no_fma.hist
linpackd_x86-64-v3-no_fma.hist

Comment 4 Jakub Jelinek 2023-08-01 11:47:46 UTC
That patch is in GCC for 5 years, so either it doesn't work in this case, or it is some other related bug.
The current setting is
/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
   smaller FMA chain.  */
DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3)

/* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
   smaller FMA chain.  */
DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
          | m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM)
What -mtune= you get for -mtune=native in your case?  gcc -S -mtune=native -v -xc /dev/null -o /dev/null 2>&1 | grep -v mtune
should show that...  If it is -mtune=znver4, perhaps we need to add | m_ZNVER4 to some of those.
With -mtune=generic (the default) this workaround is not in effect.

Comment 5 Florian Weimer 2023-08-01 12:00:25 UTC
(In reply to Jakub Jelinek from comment #4)
> With -mtune=generic (the default) this workaround is not in effect.

And -march=x86-64-v3 should use those default tunings. Maybe we should have an upstream discussion whether we should change the default tuning.

Comment 6 Jakub Jelinek 2023-08-01 12:05:01 UTC
We certainly shouldn't change the default tuning (that we tune with -mtune=generic), what we could consider including | m_GENERIC in those.
But that requires wide discussions between Intel and AMD, as -mtune=generic is tuning for recent chips from both of those vendors, and it matters how much it gains for some CPUs and how much it makes things slower on others.

Comment 10 Aoife Moloney 2024-05-28 13:37:00 UTC
Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21.

Fedora Linux 38 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.