Bug 1880670
Summary: | glibc: memcpy calls are slower for x86_64 processors on RHEL 8 than on RHEL 7 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Brandon Clark <brclark> | ||||
Component: | glibc | Assignee: | Florian Weimer <fweimer> | ||||
Status: | CLOSED ERRATA | QA Contact: | Sergey Kolosov <skolosov> | ||||
Severity: | unspecified | Docs Contact: | Zuzana Zoubkova <zzoubkov> | ||||
Priority: | unspecified | ||||||
Version: | 8.2 | CC: | alanm, amike, ashankar, brclark, casantos, chorn, codonell, dj, fweimer, jaeshin, jwright, kdudka, mbliss, mcermak, mkolbas, mnewsome, pfrankli, sajan.karumanchi, sipoyare, tnagata, vmukhame | ||||
Target Milestone: | rc | Keywords: | Bugfix, Patch, Triaged, ZStream | ||||
Target Release: | 8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | glibc-2.28-137.el8 | Doc Type: | Bug Fix | ||||
Doc Text: |
.The `glibc` string functions now avoid negative impact on system cache on AMD64 and Intel 64 processors
Previously, the `glibc` implementation of string functions incorrectly estimated the amount of last-level cache available to a thread on the 64-bit AMD and Intel processors. As a consequence, calling the `memcpy` function on large buffers either negatively impacted the overall cache performance of the system or slowed down the `memcpy` system call.
With this update, the last-level cache size is no longer scaled with the number of reported hardware threads in the system. As a result, the string functions now bypass caches for large buffers, avoiding negative impact on the rest of the system cache.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1913750 (view as bug list) | Environment: | |||||
Last Closed: | 2021-05-18 14:36:39 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1893197 | ||||||
Bug Blocks: | 1913750 | ||||||
Attachments: |
|
The sosreport shows this: processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 49 model name : AMD EPYC 7542 32-Core Processor stepping : 0 microcode : 0x8301038 cpu MHz : 2428.418 cache size : 512 KB We have a couple of those in Beaker, so we can check fairly quickly if it's fixed upstream. I configured a AMD EPYC 7452 32-Core Processor with RHEL 8.3. I built glibc upstream master as of commit 2deb7793907c7995b094b3778017c0ef0bd432d5 Testsuite results look clean. Baseline: glibc-2.17-317.el7.x86_64.rpm: ./rpm/lib64/ld-linux-x86-64.so.2 --library-path /root/rpm/lib64/ ./test_memcpy 32 32 MB = 1.763765 ms RHLE8: glibc-2.28-127.el8.x86_64: for i in a b c d e f g h i j; do ./test_memcpy 32; done 32 MB = 3.400531 ms 32 MB = 3.219160 ms 32 MB = 3.211369 ms 32 MB = 3.207968 ms 32 MB = 3.216026 ms 32 MB = 3.215049 ms 32 MB = 3.207945 ms 32 MB = 3.208351 ms 32 MB = 3.203935 ms 32 MB = 3.208126 ms Picked routine is __memmove_avx_unaligned. This performance is lower than RHEL7. Upstream: glibc-2.32.9000 for i in a b c d e f g h i j; do ./build/elf/ld-linux-x86-64.so.2 --library-path /root/build/ ./test_memcpy 32; done 32 MB = 1.766122 ms 32 MB = 1.761784 ms 32 MB = 1.760405 ms 32 MB = 1.780899 ms 32 MB = 1.760634 ms 32 MB = 1.760430 ms 32 MB = 1.762390 ms 32 MB = 1.762698 ms 32 MB = 1.763199 ms 32 MB = 1.761510 ms Picked routine is __memmove_avx_unaligned. This upstream performance is the same as RHEL7 again so something in upstream has improved the performance. The regression was fixed by: commit d3c57027470b78dba79c6d931e4e409b1fecfc80 Author: Patrick McGehearty <patrick.mcgehearty> Date: Mon Sep 28 20:11:28 2020 +0000 Reversing calculation of __x86_shared_non_temporal_threshold This makes some sense here and upstream we discussed the balancing of that L3 cache among threads. I experimented a bit for the EPYC device: Reducing the cache usage from 75% -> 50% worsens performance e.g. ~1.91ms. Reducing the cache usage from 75% -> 66% is about the same performance e.g. ~1.76ms So there is some room for fine tuning this for the device. There is definitely some architectural details that matter here and AMD is going to have to review this. - In RHEL 8 __x86_shared_non_temporal_threshold is "cache * threads * 3 / 4" - Value of cache is 1572864. - Value of threads is 128. - Thus __x86_shared_non_temporal_threshold is 150,994,944 or 150MiB (too high) - In upstream master __x86_shared_non_temporal_threshold is "cache * 3 / 4" - Value of cache is 1572864. - Value of threads is 128. - Thus __x86_shared_non_temporal_threshold is 1,179,648 or ~1.7MiB. It is not clear to me if the relevant code for AMD is correctly modelling the topology of the shared thread count for the cache. 803 /* Figure out the number of logical threads that share L3. */ 804 if (max_cpuid_ex >= 0x80000008) 805 { 806 /* Get width of APIC ID. */ 807 __cpuid (0x80000008, max_cpuid_ex, ebx, ecx, edx); 808 threads = 1 << ((ecx >> 12) & 0x0f); 809 } This is fairly rudimentary and could probably be improved. We divide the whole of the L3 cache size by the reported "threads" count to balance the cache usage. If the threads value is too high we'll be giving threads less cache than they could use. I have placed an unsupported test build with a backport of the relevant upstream patch here: https://people.redhat.com/~fweimer/RbMvxmRwQE1x/glibc-2.28-130.el8.bz1880670.1/ I would appreciate if those affected by this issue could test this build and report if it addresses this issue. Thanks. I have posted a patch to bring in performance gains for memcpy/memove on AMD machines in August, which tunes the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to 2/3 of shareable cache per thread. Though this patch brought in performance gains in walk-bench results, I did see regression for memory ranges of 1MB to 8MB in large-bench results. I have mentioned this in the cover-note, looking for answers to the discrepancies in the results. https://sourceware.org/pipermail/libc-alpha/2020-August/117080.html As I did not get any response to my queries, then I started working on a solution for AMD Zen architectures to fix this regression. On the verge of pushing my patch, I saw Patrick's patch already committed to 2.32. This patch by Patrick is in the similar lines of tuning the non-temporal threshold and had brought in the regression problem on AMD Zen machines as mentioned earlier. I have rebased my patch and re-run benchmark tests to come up with a solution to handle this regression on the master branch. This solution brings in performance gains of ~44% for memory sizes greater than 16MB with no regression for 1MB to 8MB memory sizes on Large bench results. The patch "Optimizing memcpy for AMD Zen architecture" is pushed for review. Patch: https://sourceware.org/pipermail/libc-alpha/2020-October/118895.html More details on the patch and performance numbers cover-note: https://sourceware.org/pipermail/libc-alpha/2020-October/118894.html Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: glibc security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1585 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |
Created attachment 1715440 [details] test_memcpy.c for reproduction steps. Description of problem: A customer has observed that compiling a program that uses memcpy on a AMD EPYC processor (using generic tuning) results in this function taking 1.32s longer to process 1000 memcpy calls than RHEL7. Version-Release number of selected component (if applicable): glibc-2.28-101.el8.x86_64 How reproducible: They were able to consistently have this occur. Steps to Reproduce: 1. Have AMD Zen2 processor. 2. Compile test file using arguments '-mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64' 3. run time command compiled application with '32' as argument of test_memcpy 4. Repeat test on RHEL 7. Actual results: Execution of compiled application is slower than RHEL 7. Expected results: Execution is as fast (or faster) than RHEL 7. Additional Information: Customer is currently using following GLIBC_TUNABLES for a workaround: GLIBC_TUNABLES=glibc.tune.hwcaps=-AVX_Usable,-AVX2_Usable,-Prefer_ERMS,-Prefer_FSRM,Prefer_No_AVX512,Prefer_No_VZEROUPPER,-AVX_Fast_Unaligned_Load,-ERMS