Created attachment 1952419 [details] Simple memcpy reproducer Created attachment 1952419 [details] Simple memcpy reproducer Description of problem: Customer reported performance regression from RHEL 7 to RHEL 8 in Intel Skylake. Version-Release number of selected component (if applicable): How reproducible: The customer used the following example to demonstrate the problem. # perf bench mem memcpy -f default --nr_loops 500 --size 3MB That test achieved 8.5 GB/sec on RHEL-7.5, and only 5.3 GB/sec on RHEL-8.4. This is easily reproducible. Steps to Reproduce: Run the above test on RHEL-7.5 and again on RHEL-8.4. The customer had a 2-socket Skylake server. I have been able to reproduce this on a 2-socket Cascade Lake server. Additional info: Thanks to great triaging help from Carlos O'Donell, the problem is understood. It turns out glibc is selecting a sub-optimal memcpy routine for that processor. On RHEL-7.5, it used the "__memcpy_ssse3_back()" routine, which was the optimal choice then. On RHEL-8.4, the glibc memcpy routine used is "__memmove_avx_unaligned_erms()". On RHEL-8.4, if the "Prefer_ERMS" attribute is given to glibc, then the faster "__memmove_erms()" is used. For example, slow and fast cases: # perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB 5.468937 GB/sec # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS \ > perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB 12.508272 GB/sec I've also attached a simple memcpy reproducer to demonstrate the problem, as shown below: # gcc -O memcpy.c -o memcpy # ./memcpy --help USAGE: ./memcpy size-in-MB loop-iterations # ./memcpy 3 500 Rate for 500 3MB memcpy iterations: 7.30 GB/sec # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500 Rate for 500 3MB memcpy iterations: 27.29 GB/sec The customer's system did boot with mitigations=off, and with transparent_hugepages (THP) disabled. Neither are needed to reproduce this problem, but disabling THP does enable the simple memcpy reproducer to achieve much higher performance.
Just to clarify, I was able to reproduce this on Intel Skylake and Cascade Lake servers. The problem appears less severe on an Ice Lake server. The default routine used for memcpy there is __memcpy_evex_unaligned_erms(), and the routine used with Prefer_ERMS is __memcpy_erms(). On the Ice Lake, I get the following using the attached memcpy reproducer: # ./memcpy 3 500 Rate for 500 3MB memcpy iterations: 27.42 GB/sec # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500 Rate for 500 3MB memcpy iterations: 30.17 GB/sec
Joe, Could you please add the cpuinfo for the CPU you're testing with?
Here's the lscpu output: # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 1 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz BIOS Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz Stepping: 4 CPU MHz: 3800.000 CPU max MHz: 3800.0000 CPU min MHz: 1000.0000 BogoMIPS: 5000.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 39424K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities
There are two things we need to review. (a) The tunables listed via ld.so --list-tunables do not seem to be populated with the correct values. - The tunable values are used to influence the behaviour of memcpy algorithms. - DJ Delorie is reviewing the tunables values, and the cpuid results, for this Skylake system. (b) The dynamic runtime selection of the memcpy routine for Skylake. - I'm reviewing the selection, and possible alternatives that might provide the best geomean performance for the *system*. - I am in discussions with Intel engineers to review microbenchmark results. I'll update the bug as we make progress.
We are discussing with Intel if we should be adjusting the dyanmic threshold differently for these processors (Skylake, Cascade Lake, and Ice Lake). To give a brief overview, the families include a different L3 topology, and to some degree we must balance single-thread performance vs. multi-threaded performance. RHEL7 tunes preferentially for single-threaded performance, with multi-threaded performance showing as severe degradation in benchmarks like STREAM. In RHEL8 and RHEL9 performance is balanced in the middle between single core and the whole system, and the key parameter here is how much L3 cache to allow each in-flight memory copy to consume. The change in RHEL 8.3 looks like this: ~~~ +- /* The large memcpy micro benchmark in glibc shows that 6 times of +- shared cache size is the approximate value above which non-temporal +- store becomes faster on a 8-core processor. This is the 3/4 of the +- total shared cache size. */ ++ /* The default setting for the non_temporal threshold is 3/4 of one ++ thread's share of the chip's cache. For most Intel and AMD processors ++ with an initial release date between 2017 and 2020, a thread's typical ++ share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4 ++ threshold leaves 125 KBytes to 500 KBytes of the thread's data ++ in cache after a maximum temporal copy, which will maintain ++ in cache a reasonable portion of the thread's stack and other ++ active data. If the threshold is set higher than one thread's ++ share of the cache, it has a substantial risk of negatively ++ impacting the performance of other threads running on the chip. */ + __x86_shared_non_temporal_threshold + = (cpu_features->non_temporal_threshold != 0 + ? cpu_features->non_temporal_threshold +- : __x86_shared_cache_size * threads * 3 / 4); ++ : __x86_shared_cache_size * 3 / 4); ~~~ This is the current upstream glibc implementation as discussed between Intel, AMD, Red Hat, Oracle and others who contributed overall to the tuning of this parameter for whole-system performance. Before the change in RHEL 8.3 we were aggressively favouring single-threaded performance. This is probably room to improve the current tuning on system that have a unified L3 like Skylake has, and that is something we are still investigating. Options for customer look like this: (a) Use the ERMS routine directly: export GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS (b) Raise the threshold for non-temporal stores: export GLIBC_TUNABLES=glibc.cpu.hwcaps=glibc.cpu.x86_non_temporal_threshold=0x2ffff0 Next steps: - Continue to review tuning for Skylake, Cascade Lake and Ice Lake.