2227278 – On AMD EPYC3, Linpack benchmark compiled with -march=x86-64-v3 runs slower by 22% compared to -march=x86-64-v2

Bug 2227278 - On AMD EPYC3, Linpack benchmark compiled with -march=x86-64-v3 runs slower by 22% compared to -march=x86-64-v2

Summary: On AMD EPYC3, Linpack benchmark compiled with -march=x86-64-v3 runs slower by...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gcc
Sub Component:
Version:	38
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-28 15:28 UTC by Jiri Hladky
Modified:	2024-05-28 13:37 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2024-05-28 13:37:00 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Updated testcase with -mno-fma variant (65.28 KB, application/x-xz) 2023-08-01 11:39 UTC, Jiri Hladky	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2227081	1	None	None	None	2023-09-27 18:07:47 UTC
Red Hat Issue Tracker	RHEL-10098	0	None	None	None	2024-03-14 17:44:49 UTC

Internal Links: 2227081

Description Jiri Hladky 2023-07-28 15:28:55 UTC

On AMD EPYC3 platform, I have compiled the linpack benchmark with the following three settings using gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4) from F38:

CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v2
CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3
CFLAGS=-O2 -DUNROLL -Wall -Wextra -Wshadow -march=x86-64-v3 -mtune=native

$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5739246 Kflops
linpackd_x86-64-v3.log:4440715 Kflops
linpackd_x86-64-v3_mtune_native.log:4518391 Kflops

As you can see, with -march=x86-64-v3, the resulting binary is slower by 22% than with -march=x86-64-v2. This is true even if I use -mtune=native. 

I'm puzzled by this result. x86-64-v3 (see https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) enables AVX instructions, which should help the performance. 

Could somebody please help to analyze what's happening? I have used Intel SDE tool
https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html

to verify that the binary compiled with -march=x86-64-v3 indeed uses the AVX instructions. 

wget https://downloadmirror.intel.com/784319/sde-external-9.24.0-2023-07-13-lin.tar.xz
tar xvf sde-external-9.24.0-2023-07-13-lin.tar.xz
sde-external-9.24.0-2023-07-13-lin/sde64 -mix -omix instruction.hist -- ./linpackd_x86-64-v3

isa-ext-AVX went up from 202195 to 17731412574 instructions and
isa-set-SSE2 went down from 35140277958 to 92179 instructions when moving from -march=x86-64-v2 to -march=x86-64-v3

$ grep -H --max-count=1 isa-ext-AVX *hist
linpackd_x86-64-v2.hist:*isa-ext-AVX                                                      202195
linpackd_x86-64-v3.hist:*isa-ext-AVX                                                 17731412574
linpackd_x86-64-v3_mtune_native.hist:*isa-ext-AVX                                                 17744399086

$ grep -H --max-count=1 isa-set-SSE2 *hist
linpackd_x86-64-v2.hist:*isa-set-SSE2                                                35140277958
linpackd_x86-64-v3.hist:*isa-set-SSE2                                                      92179
linpackd_x86-64-v3_mtune_native.hist:*isa-set-SSE2                                                      92176


Reproducible: Always

Steps to Reproduce:
1.  dnf install hwloc-devel
2.  make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log
4. Optional - install SDE tool from Intel - https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html
4.1 Update path to sde64 binary in script generate_instruction_histogram.sh
4.2 ./generate_instruction_histogram.sh to generate histogram of instruction usage
4.3 
linpackd_x86-64-v2.log:5739246 Kflops
linpackd_x86-64-v3.log:4440715 Kflops
linpackd_x86-64-v3_mtune_native.log:4518391 Kflops

Actual Results:  
Binary compiled with -march=x86-64-v3 is slower by 22% than binary compiled with -march=x86-64-v2. This happens even if I use -mtune=native. Reproduced on AMD EPYC3 platform. Intel platform (tested on Icelake and Sapphire Rapids) is not affected. 

Expected Results:  
Binary compiled with -march=x86-64-v3 performs at least at the same level as binary compiled with -march=x86-64-v2.

Comment 1 Jiri Hladky 2023-07-28 15:42:24 UTC

Created attachment 1980485 [details]
A standalone reproducer

Steps to Reproduce:
1.  dnf install hwloc-devel
2.  make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log

See included README file for more details.

Comment 2 Jiri Hladky 2023-08-01 11:38:05 UTC

GCC generates a tight loop with the FMA chain. 

For Zen-based AMD CPUs, the FMA chain with dependency causes a regression. A patch in GCC is being developed https://gcc.gnu.org/legacy-ml/gcc-patches/2017-12/msg01053.html.

The solution is to compile it with -mno-fma

We will need to wait till the patch is complete and merged into GCC.

Comment 3 Jiri Hladky 2023-08-01 11:39:23 UTC

Created attachment 1981064 [details]
Updated testcase with -mno-fma variant

Update testcase with -mno-fma variant

It includes all results from AMD EPYC 7573X 32-Core server.

1) Log files with results
$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5682088 Kflops
linpackd_x86-64-v3.log:4452227 Kflops
linpackd_x86-64-v3_mtune_native.log:4506376 Kflops
linpackd_x86-64-v3_mtune_native-no_fma.log:5645570 Kflops
linpackd_x86-64-v3-no_fma.log:5744116 Kflops

2) Instruction usage histogram
$ ls *hist | cat
linpackd_x86-64-v2.hist
linpackd_x86-64-v3.hist
linpackd_x86-64-v3_mtune_native.hist
linpackd_x86-64-v3_mtune_native-no_fma.hist
linpackd_x86-64-v3-no_fma.hist

Comment 4 Jakub Jelinek 2023-08-01 11:47:46 UTC

That patch is in GCC for 5 years, so either it doesn't work in this case, or it is some other related bug.
The current setting is
/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
   smaller FMA chain.  */
DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3)

/* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
   smaller FMA chain.  */
DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
          | m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM)
What -mtune= you get for -mtune=native in your case?  gcc -S -mtune=native -v -xc /dev/null -o /dev/null 2>&1 | grep -v mtune
should show that...  If it is -mtune=znver4, perhaps we need to add | m_ZNVER4 to some of those.
With -mtune=generic (the default) this workaround is not in effect.

Comment 5 Florian Weimer 2023-08-01 12:00:25 UTC

(In reply to Jakub Jelinek from comment #4)
> With -mtune=generic (the default) this workaround is not in effect.

And -march=x86-64-v3 should use those default tunings. Maybe we should have an upstream discussion whether we should change the default tuning.

Comment 6 Jakub Jelinek 2023-08-01 12:05:01 UTC

We certainly shouldn't change the default tuning (that we tune with -mtune=generic), what we could consider including | m_GENERIC in those.
But that requires wide discussions between Intel and AMD, as -mtune=generic is tuning for recent chips from both of those vendors, and it matters how much it gains for some CPUs and how much it makes things slower on others.

Comment 10 Aoife Moloney 2024-05-28 13:37:00 UTC

Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21.

Fedora Linux 38 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.