Bug 2227278
| Summary: | On AMD EPYC3, Linpack benchmark compiled with -march=x86-64-v3 runs slower by 22% compared to -march=x86-64-v2 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Jiri Hladky <jhladky> | ||||
| Component: | gcc | Assignee: | Jakub Jelinek <jakub> | ||||
| Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 38 | CC: | dmalcolm, fweimer, jakub, jlaw, jwakely, mcermak, mpolacek, msebor, nickc, sipoyare | ||||
| Target Milestone: | --- | Keywords: | Performance | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | --- | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Jiri Hladky
2023-07-28 15:28:55 UTC
Created attachment 1980485 [details]
A standalone reproducer
Steps to Reproduce:
1. dnf install hwloc-devel
2. make run
3. Compare the performance reported:
grep -PoH "[0-9]+ Kflops" *log
See included README file for more details.
GCC generates a tight loop with the FMA chain. For Zen-based AMD CPUs, the FMA chain with dependency causes a regression. A patch in GCC is being developed https://gcc.gnu.org/legacy-ml/gcc-patches/2017-12/msg01053.html. The solution is to compile it with -mno-fma We will need to wait till the patch is complete and merged into GCC. Created attachment 1981064 [details]
Updated testcase with -mno-fma variant
Update testcase with -mno-fma variant
It includes all results from AMD EPYC 7573X 32-Core server.
1) Log files with results
$ grep -PoH "[0-9]+ Kflops" *log
linpackd_x86-64-v2.log:5682088 Kflops
linpackd_x86-64-v3.log:4452227 Kflops
linpackd_x86-64-v3_mtune_native.log:4506376 Kflops
linpackd_x86-64-v3_mtune_native-no_fma.log:5645570 Kflops
linpackd_x86-64-v3-no_fma.log:5744116 Kflops
2) Instruction usage histogram
$ ls *hist | cat
linpackd_x86-64-v2.hist
linpackd_x86-64-v3.hist
linpackd_x86-64-v3_mtune_native.hist
linpackd_x86-64-v3_mtune_native-no_fma.hist
linpackd_x86-64-v3-no_fma.hist
That patch is in GCC for 5 years, so either it doesn't work in this case, or it is some other related bug.
The current setting is
/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
smaller FMA chain. */
DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3)
/* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
smaller FMA chain. */
DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
| m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM)
What -mtune= you get for -mtune=native in your case? gcc -S -mtune=native -v -xc /dev/null -o /dev/null 2>&1 | grep -v mtune
should show that... If it is -mtune=znver4, perhaps we need to add | m_ZNVER4 to some of those.
With -mtune=generic (the default) this workaround is not in effect.
(In reply to Jakub Jelinek from comment #4) > With -mtune=generic (the default) this workaround is not in effect. And -march=x86-64-v3 should use those default tunings. Maybe we should have an upstream discussion whether we should change the default tuning. We certainly shouldn't change the default tuning (that we tune with -mtune=generic), what we could consider including | m_GENERIC in those. But that requires wide discussions between Intel and AMD, as -mtune=generic is tuning for recent chips from both of those vendors, and it matters how much it gains for some CPUs and how much it makes things slower on others. |