Bug 2180462
Summary: | glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Joe Mario <jmario> | ||||
Component: | glibc | Assignee: | DJ Delorie <dj> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Coufal <mcoufal> | ||||
Severity: | high | Docs Contact: | Jacob Taylor Valdez <jvaldez> | ||||
Priority: | unspecified | ||||||
Version: | 8.4 | CC: | ashankar, barend.havenga, bgray, cdeardor, codonell, dj, fweimer, jvaldez, jyoung, mcermak, pandrade, pfrankli, sipoyare | ||||
Target Milestone: | rc | Keywords: | Regression, Triaged | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glibc-2.28-236.el8 | Doc Type: | Enhancement | ||||
Doc Text: |
.Improved string and memory routine performance on Intel® Xeon® v5-based hardware in `glibc`
Previously, the default amount of cache used by `glibc` for string and memory routines resulted in lower than expected performance on Intel® Xeon® v5-based systems. With this update, the amount of cache to use has been tuned to improve performance.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 2213907 (view as bug list) | Environment: | |||||
Last Closed: | 2023-11-14 15:49:05 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2213907 | ||||||
Attachments: |
|
Description
Joe Mario
2023-03-21 14:33:19 UTC
Just to clarify, I was able to reproduce this on Intel Skylake and Cascade Lake servers. The problem appears less severe on an Ice Lake server. The default routine used for memcpy there is __memcpy_evex_unaligned_erms(), and the routine used with Prefer_ERMS is __memcpy_erms(). On the Ice Lake, I get the following using the attached memcpy reproducer: # ./memcpy 3 500 Rate for 500 3MB memcpy iterations: 27.42 GB/sec # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500 Rate for 500 3MB memcpy iterations: 30.17 GB/sec Joe, Could you please add the cpuinfo for the CPU you're testing with? Here's the lscpu output: # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 1 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz BIOS Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz Stepping: 4 CPU MHz: 3800.000 CPU max MHz: 3800.0000 CPU min MHz: 1000.0000 BogoMIPS: 5000.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 39424K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities There are two things we need to review. (a) The tunables listed via ld.so --list-tunables do not seem to be populated with the correct values. - The tunable values are used to influence the behaviour of memcpy algorithms. - DJ Delorie is reviewing the tunables values, and the cpuid results, for this Skylake system. (b) The dynamic runtime selection of the memcpy routine for Skylake. - I'm reviewing the selection, and possible alternatives that might provide the best geomean performance for the *system*. - I am in discussions with Intel engineers to review microbenchmark results. I'll update the bug as we make progress. We are discussing with Intel if we should be adjusting the dyanmic threshold differently for these processors (Skylake, Cascade Lake, and Ice Lake). To give a brief overview, the families include a different L3 topology, and to some degree we must balance single-thread performance vs. multi-threaded performance. RHEL7 tunes preferentially for single-threaded performance, with multi-threaded performance showing as severe degradation in benchmarks like STREAM. In RHEL8 and RHEL9 performance is balanced in the middle between single core and the whole system, and the key parameter here is how much L3 cache to allow each in-flight memory copy to consume. The change in RHEL 8.3 looks like this: ~~~ +- /* The large memcpy micro benchmark in glibc shows that 6 times of +- shared cache size is the approximate value above which non-temporal +- store becomes faster on a 8-core processor. This is the 3/4 of the +- total shared cache size. */ ++ /* The default setting for the non_temporal threshold is 3/4 of one ++ thread's share of the chip's cache. For most Intel and AMD processors ++ with an initial release date between 2017 and 2020, a thread's typical ++ share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4 ++ threshold leaves 125 KBytes to 500 KBytes of the thread's data ++ in cache after a maximum temporal copy, which will maintain ++ in cache a reasonable portion of the thread's stack and other ++ active data. If the threshold is set higher than one thread's ++ share of the cache, it has a substantial risk of negatively ++ impacting the performance of other threads running on the chip. */ + __x86_shared_non_temporal_threshold + = (cpu_features->non_temporal_threshold != 0 + ? cpu_features->non_temporal_threshold +- : __x86_shared_cache_size * threads * 3 / 4); ++ : __x86_shared_cache_size * 3 / 4); ~~~ This is the current upstream glibc implementation as discussed between Intel, AMD, Red Hat, Oracle and others who contributed overall to the tuning of this parameter for whole-system performance. Before the change in RHEL 8.3 we were aggressively favouring single-threaded performance. This is probably room to improve the current tuning on system that have a unified L3 like Skylake has, and that is something we are still investigating. Options for customer look like this: (a) Use the ERMS routine directly: export GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS (b) Raise the threshold for non-temporal stores: export GLIBC_TUNABLES=glibc.cpu.hwcaps=glibc.cpu.x86_non_temporal_threshold=0x2ffff0 Next steps: - Continue to review tuning for Skylake, Cascade Lake and Ice Lake. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (glibc bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7107 |